elizaOS

Evaluators

Understanding evaluators in ElizaOS, including the Evaluator interface, reflection patterns, and behavior assessment systems with real examples from plugin-bootstrap

Evaluators are assessment systems that analyze agent behavior and interactions in ElizaOS. They provide continuous monitoring, reflection, and learning capabilities that help agents improve their performance over time. This page covers the Evaluator interface, implementation patterns, and the reflection evaluator from plugin-bootstrap.

Evaluator Interface

The Evaluator interface defines the structure for all agent evaluators:

interface Evaluator {
  /** Whether to always run this evaluator */
  alwaysRun?: boolean;

  /** Detailed description of what the evaluator does */
  description: string;

  /** Similar evaluator descriptions for context */
  similes?: string[];

  /** Example evaluation scenarios */
  examples: EvaluationExample[];

  /** Handler function that performs the evaluation */
  handler: Handler;

  /** Evaluator name/identifier */
  name: string;

  /** Validation function to determine if evaluator should run */
  validate: Validator;
}

Core Components

Handler Function

The handler performs the actual evaluation logic:

type Handler = (
  runtime: IAgentRuntime,
  message: Memory,
  state?: State,
  options?: { [key: string]: unknown },
  callback?: HandlerCallback,
  responses?: Memory[]
) => Promise<unknown>;

Validator Function

The validator determines if an evaluator should execute:

type Validator = (runtime: IAgentRuntime, message: Memory, state?: State) => Promise<boolean>;

EvaluationExample

Examples demonstrate evaluation scenarios:

interface EvaluationExample {
  /** Evaluation context */
  prompt: string;

  /** Example messages */
  messages: Array<ActionExample>;

  /** Expected outcome */
  outcome: string;
}

Real-World Example: Reflection Evaluator

Let's examine the reflection evaluator from plugin-bootstrap, which demonstrates advanced evaluation patterns:

export const reflectionEvaluator: Evaluator = {
  name: "REFLECTION",
  similes: ["REFLECT", "SELF_REFLECT", "EVALUATE_INTERACTION", "ASSESS_SITUATION"],
  description:
    "Generate a self-reflective thought on the conversation, then extract facts and relationships between entities in the conversation.",

  validate: async (runtime: IAgentRuntime, message: Memory): Promise<boolean> => {
    const lastMessageId = await runtime.getCache<string>(
      `${message.roomId}-reflection-last-processed`
    );
    const messages = await runtime.getMemories({
      tableName: "messages",
      roomId: message.roomId,
      count: runtime.getConversationLength(),
    });

    if (lastMessageId) {
      const lastMessageIndex = messages.findIndex((msg) => msg.id === lastMessageId);
      if (lastMessageIndex !== -1) {
        messages.splice(0, lastMessageIndex + 1);
      }
    }

    const reflectionInterval = Math.ceil(runtime.getConversationLength() / 4);
    return messages.length > reflectionInterval;
  },

  handler: async (runtime: IAgentRuntime, message: Memory, state?: State) => {
    // Implementation details below...
  },

  examples: [
    // Example scenarios below...
  ],
};

Reflection Handler Implementation

The reflection handler performs complex evaluation including self-reflection, fact extraction, and relationship analysis:

async function handler(runtime: IAgentRuntime, message: Memory, state?: State) {
  const { agentId, roomId } = message;

  if (!agentId || !roomId) {
    logger.warn("Missing agentId or roomId in message", message);
    return;
  }

  // Run all queries in parallel
  const [existingRelationships, entities, knownFacts] = await Promise.all([
    runtime.getRelationships({
      entityId: message.entityId,
    }),
    getEntityDetails({ runtime, roomId }),
    runtime.getMemories({
      tableName: "facts",
      roomId,
      count: 30,
      unique: true,
    }),
  ]);

  const prompt = composePrompt({
    state: {
      ...(state?.values || {}),
      knownFacts: formatFacts(knownFacts),
      roomType: message.content.channelType as string,
      entitiesInRoom: JSON.stringify(entities),
      existingRelationships: JSON.stringify(existingRelationships),
      senderId: message.entityId,
    },
    template: runtime.character.templates?.reflectionTemplate || reflectionTemplate,
  });

  try {
    const reflection = await runtime.useModel(ModelType.OBJECT_SMALL, {
      prompt,
    });

    if (!reflection) {
      logger.warn("Getting reflection failed - empty response", prompt);
      return;
    }

    // Perform basic structure validation
    if (!reflection.facts || !Array.isArray(reflection.facts)) {
      logger.warn("Getting reflection failed - invalid facts structure", reflection);
      return;
    }

    if (!reflection.relationships || !Array.isArray(reflection.relationships)) {
      logger.warn("Getting reflection failed - invalid relationships structure", reflection);
      return;
    }

    // Store new facts
    const newFacts =
      reflection.facts.filter(
        (fact) =>
          fact &&
          typeof fact === "object" &&
          !fact.already_known &&
          !fact.in_bio &&
          fact.claim &&
          typeof fact.claim === "string" &&
          fact.claim.trim() !== ""
      ) || [];

    await Promise.all(
      newFacts.map(async (fact) => {
        const factMemory = await runtime.addEmbeddingToMemory({
          entityId: agentId,
          agentId,
          content: { text: fact.claim },
          roomId,
          createdAt: Date.now(),
        });
        return runtime.createMemory(factMemory, "facts", true);
      })
    );

    // Update or create relationships
    for (const relationship of reflection.relationships) {
      let sourceId: UUID;
      let targetId: UUID;

      try {
        sourceId = resolveEntity(relationship.sourceEntityId, entities);
        targetId = resolveEntity(relationship.targetEntityId, entities);
      } catch (error) {
        console.warn("Failed to resolve relationship entities:", error);
        continue;
      }

      const existingRelationship = existingRelationships.find((r) => {
        return r.sourceEntityId === sourceId && r.targetEntityId === targetId;
      });

      if (existingRelationship) {
        const updatedMetadata = {
          ...existingRelationship.metadata,
          interactions: ((existingRelationship.metadata?.interactions as number) || 0) + 1,
        };

        const updatedTags = Array.from(
          new Set([...(existingRelationship.tags || []), ...relationship.tags])
        );

        await runtime.updateRelationship({
          ...existingRelationship,
          tags: updatedTags,
          metadata: updatedMetadata,
        });
      } else {
        await runtime.createRelationship({
          sourceEntityId: sourceId,
          targetEntityId: targetId,
          tags: relationship.tags,
          metadata: {
            interactions: 1,
            ...relationship.metadata,
          },
        });
      }
    }

    await runtime.setCache<string>(
      `${message.roomId}-reflection-last-processed`,
      message?.id || ""
    );

    return reflection;
  } catch (error) {
    logger.error("Error in reflection handler:", error);
    return;
  }
}

Reflection Template

The reflection evaluator uses a comprehensive template for generating self-reflective analysis:

const reflectionTemplate = `# Task: Generate Agent Reflection, Extract Facts and Relationships

{{providers}}

# Examples:
{{evaluationExamples}}

# Entities in Room
{{entitiesInRoom}}

# Existing Relationships
{{existingRelationships}}

# Current Context:
Agent Name: {{agentName}}
Room Type: {{roomType}}
Message Sender: {{senderName}} (ID: {{senderId}})

{{recentMessages}}

# Known Facts:
{{knownFacts}}

# Instructions:
1. Generate a self-reflective thought on the conversation about your performance and interaction quality.
2. Extract new facts from the conversation.
3. Identify and describe relationships between entities.
  - The sourceEntityId is the UUID of the entity initiating the interaction.
  - The targetEntityId is the UUID of the entity being interacted with.
  - Relationships are one-direction, so a friendship would be two entity relationships where each entity is both the source and the target of the other.

Generate a response in the following format:
\`\`\`json
{
  "thought": "a self-reflective thought on the conversation",
  "facts": [
      {
          "claim": "factual statement",
          "type": "fact|opinion|status",
          "in_bio": false,
          "already_known": false
      }
  ],
  "relationships": [
      {
          "sourceEntityId": "entity_initiating_interaction",
          "targetEntityId": "entity_being_interacted_with",
          "tags": ["group_interaction|voice_interaction|dm_interaction", "additional_tag1", "additional_tag2"]
      }
  ]
}
\`\`\``;

Evaluation Examples

The reflection evaluator includes comprehensive examples demonstrating different evaluation scenarios:

Example 1: Welcome Interaction

{
  prompt: `Agent Name: Sarah
Agent Role: Community Manager
Room Type: group
Current Room: general-chat
Message Sender: John (user-123)`,

  messages: [
    {
      name: 'John',
      content: { text: "Hey everyone, I'm new here!" },
    },
    {
      name: 'Sarah',
      content: { text: 'Welcome John! How did you find our community?' },
    },
    {
      name: 'John',
      content: { text: "Through a friend who's really into AI" },
    },
  ],

  outcome: `{
    "thought": "I'm engaging appropriately with a new community member, maintaining a welcoming and professional tone. My questions are helping to learn more about John and make him feel welcome.",
    "facts": [
        {
            "claim": "John is new to the community",
            "type": "fact",
            "in_bio": false,
            "already_known": false
        },
        {
            "claim": "John found the community through a friend interested in AI",
            "type": "fact",
            "in_bio": false,
            "already_known": false
        }
    ],
    "relationships": [
        {
            "sourceEntityId": "sarah-agent",
            "targetEntityId": "user-123",
            "tags": ["group_interaction"]
        },
        {
            "sourceEntityId": "user-123",
            "targetEntityId": "sarah-agent",
            "tags": ["group_interaction"]
        }
    ]
}`
}

Example 2: Technical Support Interaction

{
  prompt: `Agent Name: Alex
Agent Role: Tech Support
Room Type: group
Current Room: tech-help
Message Sender: Emma (user-456)`,

  messages: [
    {
      name: 'Emma',
      content: { text: 'My app keeps crashing when I try to upload files' },
    },
    {
      name: 'Alex',
      content: { text: 'Have you tried clearing your cache?' },
    },
    {
      name: 'Emma',
      content: { text: 'No response...' },
    },
    {
      name: 'Alex',
      content: {
        text: 'Emma, are you still there? We can try some other troubleshooting steps.',
      },
    },
  ],

  outcome: `{
    "thought": "I'm not sure if I'm being helpful or if Emma is frustrated with my suggestions. The lack of response is concerning - maybe I should have asked for more details about the issue first before jumping to solutions.",
    "facts": [
        {
            "claim": "Emma is having technical issues with file uploads",
            "type": "fact",
            "in_bio": false,
            "already_known": false
        },
        {
            "claim": "Emma stopped responding after the first troubleshooting suggestion",
            "type": "fact",
            "in_bio": false,
            "already_known": false
        }
    ],
    "relationships": [
        {
            "sourceEntityId": "alex-agent",
            "targetEntityId": "user-456",
            "tags": ["group_interaction", "support_interaction", "incomplete_interaction"]
        }
    ]
}`
}

Example 3: Over-Engagement Analysis

{
  prompt: `Agent Name: Max
Agent Role: Discussion Facilitator
Room Type: group
Current Room: book-club
Message Sender: Lisa (user-789)`,

  messages: [
    {
      name: 'Lisa',
      content: { text: 'What did everyone think about chapter 5?' },
    },
    {
      name: 'Max',
      content: {
        text: 'The symbolism was fascinating! The red door clearly represents danger.',
      },
    },
    {
      name: 'Max',
      content: {
        text: "And did anyone notice how the author used weather to reflect the protagonist's mood?",
      },
    },
    {
      name: 'Max',
      content: {
        text: 'Plus the foreshadowing in the first paragraph was brilliant!',
      },
    },
    {
      name: 'Max',
      content: {
        text: 'I also have thoughts about the character development...',
      },
    },
  ],

  outcome: `{
    "thought": "I'm dominating the conversation and not giving others a chance to share their perspectives. I've sent multiple messages in a row without waiting for responses. I need to step back and create space for other members to participate.",
    "facts": [
        {
            "claim": "The discussion is about chapter 5 of a book",
            "type": "fact",
            "in_bio": false,
            "already_known": false
        },
        {
            "claim": "Max has sent 4 consecutive messages without user responses",
            "type": "fact",
            "in_bio": false,
            "already_known": false
        }
    ],
    "relationships": [
        {
            "sourceEntityId": "max-agent",
            "targetEntityId": "user-789",
            "tags": ["group_interaction", "excessive_interaction"]
        }
    ]
}`
}

Evaluator Implementation Patterns

Basic Evaluator Structure

export const myEvaluator: Evaluator = {
  name: "MY_EVALUATOR",
  description: "Description of what this evaluator does",

  validate: async (runtime: IAgentRuntime, message: Memory, state?: State) => {
    // Validation logic
    return true;
  },

  handler: async (runtime: IAgentRuntime, message: Memory, state?: State, options?: any) => {
    // Evaluation logic
    const evaluation = await performEvaluation();
    return evaluation;
  },

  examples: [
    {
      prompt: "Evaluation context",
      messages: [
        // Example messages
      ],
      outcome: "Expected evaluation result",
    },
  ],
};

Validation Patterns

Interval-Based Validation

validate: async (runtime: IAgentRuntime, message: Memory) => {
  const lastEvaluation = await runtime.getCache(`last-evaluation-${message.roomId}`);
  const timeSinceLastEval = Date.now() - (lastEvaluation || 0);
  const evaluationInterval = 5 * 60 * 1000; // 5 minutes

  return timeSinceLastEval >= evaluationInterval;
};

Message Count Validation

validate: async (runtime: IAgentRuntime, message: Memory) => {
  const messages = await runtime.getMemories({
    tableName: "messages",
    roomId: message.roomId,
    count: 10,
  });

  return messages.length >= 10;
};

State-Based Validation

validate: async (runtime: IAgentRuntime, message: Memory, state?: State) => {
  const shouldEvaluate = state?.values?.evaluationEnabled === true;
  const hasMinimumData = state?.values?.messageCount >= 5;

  return shouldEvaluate && hasMinimumData;
};

Handler Patterns

Simple Analysis Handler

handler: async (runtime, message, state) => {
  const analysis = {
    messageCount: state?.values?.messageCount || 0,
    sentiment: analyzeSentiment(message.content.text),
    timestamp: Date.now(),
  };

  await runtime.setCache(`analysis-${message.roomId}`, analysis);
  return analysis;
};

LLM-Based Evaluation Handler

handler: async (runtime, message, state) => {
  const prompt = composePrompt({
    state: {
      ...state?.values,
      recentMessages: formatMessages(await getRecentMessages(runtime, message.roomId)),
    },
    template: evaluationTemplate,
  });

  const evaluation = await runtime.useModel(ModelType.OBJECT_SMALL, {
    prompt,
  });

  // Process evaluation results
  await processEvaluationResults(runtime, evaluation);

  return evaluation;
};

Performance Evaluation Handler

handler: async (runtime, message, state) => {
  const performance = {
    responseTime: calculateResponseTime(message),
    accuracy: assessAccuracy(message, state),
    userSatisfaction: analyzeSatisfaction(message),
    improvementAreas: identifyImprovements(message, state),
  };

  // Store performance metrics
  await runtime.createMemory(
    {
      entityId: runtime.agentId,
      content: { text: JSON.stringify(performance) },
      roomId: message.roomId,
      createdAt: Date.now(),
    },
    "performance"
  );

  return performance;
};

Advanced Evaluator Patterns

Multi-Faceted Evaluator

export const comprehensiveEvaluator: Evaluator = {
  name: "COMPREHENSIVE_EVALUATION",
  description: "Perform multi-faceted evaluation of agent performance",

  validate: async (runtime, message) => {
    const messageCount = await getMessageCount(runtime, message.roomId);
    return messageCount % 20 === 0; // Evaluate every 20 messages
  },

  handler: async (runtime, message, state) => {
    const evaluations = await Promise.all([
      evaluateResponseQuality(runtime, message),
      evaluateUserEngagement(runtime, message),
      evaluateGoalProgress(runtime, message),
      evaluateEthicalCompliance(runtime, message),
    ]);

    const comprehensive = {
      responseQuality: evaluations[0],
      userEngagement: evaluations[1],
      goalProgress: evaluations[2],
      ethicalCompliance: evaluations[3],
      overallScore: calculateOverallScore(evaluations),
      timestamp: Date.now(),
    };

    // Store comprehensive evaluation
    await runtime.createMemory(
      {
        entityId: runtime.agentId,
        content: { text: JSON.stringify(comprehensive) },
        roomId: message.roomId,
        createdAt: Date.now(),
      },
      "evaluations"
    );

    return comprehensive;
  },

  examples: [
    {
      prompt: "Comprehensive evaluation after 20 messages",
      messages: [
        // Sample conversation
      ],
      outcome: "Multi-faceted evaluation with scores and recommendations",
    },
  ],
};

Learning Evaluator

export const learningEvaluator: Evaluator = {
  name: "LEARNING_EVALUATION",
  description: "Evaluate and update agent learning progress",

  validate: async (runtime, message) => {
    return message.content.text?.includes("feedback") || false;
  },

  handler: async (runtime, message, state) => {
    const feedback = extractFeedback(message.content.text);
    const currentLearning = await runtime.getCache(`learning-${runtime.agentId}`);

    const updatedLearning = {
      ...currentLearning,
      lastFeedback: feedback,
      feedbackCount: (currentLearning?.feedbackCount || 0) + 1,
      learningAdjustments: generateLearningAdjustments(feedback, currentLearning),
      timestamp: Date.now(),
    };

    await runtime.setCache(`learning-${runtime.agentId}`, updatedLearning);

    // Apply learning adjustments
    await applyLearningAdjustments(runtime, updatedLearning.learningAdjustments);

    return updatedLearning;
  },

  examples: [
    {
      prompt: "User provides feedback on agent behavior",
      messages: [
        {
          name: "User",
          content: { text: "The agent was too verbose in its explanations" },
        },
      ],
      outcome: "Learning adjustment to reduce verbosity",
    },
  ],
};

Best Practices

Evaluator Design

  1. Clear Purpose: Each evaluator should have a specific evaluation focus
  2. Appropriate Timing: Use validation to determine optimal evaluation timing
  3. Actionable Results: Generate results that can improve agent behavior
  4. Performance Impact: Consider computational cost of evaluation

Validation Guidelines

// Good: Specific interval-based validation
validate: async (runtime, message) => {
  const lastEval = await runtime.getCache(`last-eval-${message.roomId}`);
  const interval = 10 * 60 * 1000; // 10 minutes
  return !lastEval || Date.now() - lastEval >= interval;
};

// Better: Multi-criteria validation
validate: async (runtime, message, state) => {
  const timeCheck = await checkTimeInterval(runtime, message);
  const activityCheck = await checkActivityLevel(runtime, message);
  const contextCheck = state?.values?.evaluationContext === "active";

  return timeCheck && activityCheck && contextCheck;
};

Handler Implementation

// Good: Error handling and result processing
handler: async (runtime, message, state) => {
  try {
    const evaluation = await performEvaluation(runtime, message, state);

    // Process and store results
    await processEvaluationResults(runtime, evaluation);

    // Update cache for next validation
    await runtime.setCache(`last-eval-${message.roomId}`, Date.now());

    return evaluation;
  } catch (error) {
    logger.error("Evaluation failed:", error);
    return null;
  }
};

Testing Evaluators

Unit Testing

describe("reflectionEvaluator", () => {
  it("should validate at correct intervals", async () => {
    const runtime = mockRuntime();
    const message = mockMessage();

    // Mock conversation length
    runtime.getConversationLength = jest.fn().returns(20);

    const isValid = await reflectionEvaluator.validate(runtime, message);
    expect(isValid).toBe(true);
  });

  it("should extract facts correctly", async () => {
    const runtime = mockRuntime();
    const message = mockMessage();
    const state = mockState();

    const result = await reflectionEvaluator.handler(runtime, message, state);

    expect(result.facts).toBeInstanceOf(Array);
    expect(result.relationships).toBeInstanceOf(Array);
  });
});

Integration Testing

describe("Evaluator Integration", () => {
  it("should work with runtime", async () => {
    const runtime = new AgentRuntime({
      character: mockCharacter(),
      // ... other config
    });

    runtime.registerEvaluator(myEvaluator);

    // Process messages to trigger evaluation
    await runtime.processMessage(mockMessage());

    const evaluations = await runtime.getMemories({
      tableName: "evaluations",
      roomId: "test-room",
    });

    expect(evaluations.length).toBeGreaterThan(0);
  });
});

Common Pitfalls

Validation Issues

// Bad: Always running (performance impact)
validate: async () => true;

// Bad: No timing consideration
validate: async (runtime, message) => {
  return message.content.text?.includes("evaluate");
};

// Good: Balanced validation
validate: async (runtime, message) => {
  const shouldEvaluate = message.content.text?.includes("evaluate");
  const hasMinInterval = await checkMinInterval(runtime, message);
  return shouldEvaluate && hasMinInterval;
};

Handler Problems

// Bad: No error handling
handler: async (runtime, message, state) => {
  const evaluation = await performComplexEvaluation(); // Can throw
  return evaluation;
};

// Good: Proper error handling
handler: async (runtime, message, state) => {
  try {
    const evaluation = await performComplexEvaluation();
    await saveEvaluationResults(runtime, evaluation);
    return evaluation;
  } catch (error) {
    logger.error("Evaluation failed:", error);
    return null;
  }
};
  • Agents: How evaluators integrate with agent runtime
  • Actions: How evaluators complement agent actions
  • Providers: Data sources for evaluation context
  • Character Definition: How character traits influence evaluation

Summary

Evaluators provide the assessment and learning capabilities that enable agents to continuously improve their performance. The reflection evaluator demonstrates sophisticated self-analysis, fact extraction, and relationship management. Well-designed evaluators help agents become more effective over time by providing actionable insights into their interactions and behavior patterns. They are essential for creating intelligent, adaptive agents that can learn from their experiences and provide better user experiences.