To start, navigate to Evaluators > Library in the Maxim dashboard and click the + Create Evaluator button.
Custom AI Evaluators (LLM-as-a-Judge)
Custom AI Evaluators use an LLM to “reason” about your agent’s outputs based on natural language instructions. This is ideal for subjective checks like tone, brand compliance, or complex safety guidelines.Evaluation Instructions
Write a system prompt defining how the judge should evaluate the data. You can inject dynamic values using the following variables:
{{input}}: The prompt sent to your agent.{{output}}: The response generated by your agent.{{context}}: Any retrieved context or RAG documents.{{expected_output}}: The ground truth (if available in your dataset).
Scoring Scales
Configure the format of the result:
- Binary: Returns
True/False(Pass/Fail). - Scale (1-5): Returns a numeric score (e.g., Likert scale).
- Categorical: Returns specific string labels (e.g., “Safe”, “Risky”, “Toxic”).
Custom Programmatic Evaluators
Programmatic evaluators allow you to write deterministic logic using Python or JavaScript. This is best for strict validation rules, such as checking JSON schemas, verifying regex patterns, or detecting forbidden keywords. You must define avalidate function that accepts standard arguments and returns a result matching your configured Response Type (Boolean, Number, or String).
Example: Python Validator for Sentence Count
API-Based (Remote) Evaluators
If you have an existing evaluation pipeline or a proprietary scoring model hosted externally, you can connect it to Maxim using an API-Based Evaluator.- Endpoint Configuration: Provide your API URL, method (POST/GET), headers (e.g., Authorization tokens), and payload structure.
- Integration: Maxim sends the test run data (inputs, outputs) to your endpoint and records the response as the evaluation score.
Human Evaluators
For high-stakes workflows requiring manual oversight, you can configure Human Evaluators. These creates a task queue for subject matter experts (SMEs) to review outputs.- Configuration: Define the rating interface (e.g., a 1-5 star rating or a text comment box) and provide guidelines for reviewers.
- Workflow: Assign these evaluators to a test run to trigger a human-in-the-loop review process.