Concepts

Explore key concepts in AI evaluation, including evaluators, datasets, and custom tools for assessing model performance and output quality.

Evaluators

Evaluators are tools or metrics used to assess the quality, accuracy, and effectiveness of AI model outputs. We have various types of evaluators that can be customized and integrated into workflows and test runs. See below for more details.

You can find more about evaluators here.

Evaluator type	Description
AI	Uses AI models to assess outputs
Programmatic	Applies predefined rules or algorithms
Statistical	Utilizes statistical methods for evaluation
Human	Involves human judgment and feedback
API-based	Leverages external APIs for assessment

Evaluator Store

A large set of pre-built evaluators are available for you to use directly. These can be found in the evaluator store and added to your workspace with a single click.

At Maxim, our pre-built evaluators fall into two categories:

Maxim-created Evaluators: These are evaluators created, benchmarked, and managed by Maxim. There are three kinds of Maxim-created evaluators:
1. AI Evaluators: These evaluators use other large language models to evaluate your application (LLM-as-a-Judge).
2. Statistical Evaluators: Traditional ML metrics such as BLEU, ROUGE, WER, TER, etc.
3. Programmatic Evaluators: JavaScript functions for common use cases like validJson, validURL, etc., that help validate your responses.
Third-party Evaluators: We have also enabled popular third-party libraries for evaluation, such as RAGAS, so you can use them in your evaluation workflows with just a few clicks. If you have any custom integration requests, please feel free to drop us a note.

Within the store, you can search for an evaluator or filter by type tags. Simply click the "Add to workspace" button to make it available for use by your team.

If you want us to build a specific evaluator for your needs, please drop a line at [email protected].

Custom Evaluators

While we provide many evaluators for common use cases out of the box, we understand that some applications have specific requirements. Keeping that in mind, the platform allows for easy creation of custom evaluators of the following types:

AI Evaluators

These evaluators use other LLMs to evaluate your application. You can configure different prompts, models, and scoring strategies depending on your use case. Once tested in the playground, you can start using the evaluators in your workflows.

Programmatic Evaluators

These are JavaScript functions where you can write your own custom logic. You can use the {{input}}, {{output}}, and {{expectedOutput}} variables, which pull relevant data from the dataset column or the response of the run to execute the evaluator.

API-based Evaluators

If you have built your own evaluation model for specific use cases, you can expose the model using an HTTP endpoint and integrate it within Maxim for evaluation.

Human Evaluators

This allows for the last mile of evaluation with human annotators in the loop. You can create a Human Evaluator for specific criteria that you want annotators to assess. During a test run, simply attach the evaluators, add details of the raters, and choose the sample set for human annotation. Learn more about the human evaluation lifecycle here.

Every evaluator should return a score and reasoning, which are then analyzed and used to summarize results according to your criteria.

Evaluator Grading

Every evaluator's grading configuration has two parts:

Type of scale – Yes/No, Scale of 1-5, etc.
- For AI evaluators, this can be chosen, and an explanation is needed for grading logic.
- For programmatic evaluators, the relevant response type can be configured.
- For API-based evaluators, you can map the field to be used for scoring.
Pass criteria – This includes configuration for two levels:
- The score at which an evaluator should pass for a given query.
- The percentage of queries that need to pass for the evaluator to pass at the run level across all dataset entries.

For custom evaluators, both of these are configurable, while for pre-built evaluators, you can define your pass criteria.

The evaluator below assigns a score between 0 and 1. The pass criteria are defined such that a query passes if it scores ≥ 0.8, and for the entire report, the evaluator "Clarity" passes if 80% of the queries score ≥ 0.8.

Maxim uses reserved variables with specific meanings:

{{input}}: Input from the dataset
{{expectedOutput}}: Expected output from the dataset
{{expectedToolCalls}}: Expected tool calls from the dataset
{{scenario}}: Scenario from the dataset
{{expectedSteps}}: Expected steps from the dataset
{{output}}: Generated output of the workflow/prompt/prompt chain
{{context}}: Context to evaluate

Evaluator Reasoning

To help you analyze why certain cases perform well or underperform, we provide clear reasoning for each evaluator score. This can be viewed for each entry within the evaluation tab on its details sheet.

Multimodal Datasets

Datasets in Maxim are multimodal and can be created directly on the platform or uploaded as existing CSV files.

In Maxim, datasets can have columns of the following types (entities):

Input (Compulsory entity): A column associated with this type is treated as an input query to test your application.
Expected Output: Data representing the desired response that the application should generate for the corresponding input.
Output: This is for cases where you have run your queries elsewhere and have the outputs within your CSV that you want to evaluate directly.
Image: You can upload images or provide an image URL.
Variables: Any data that you want to dynamically change in prompts/workflows during runtime.
Expected Tool Calls: Prompt tools expected to be triggered for the corresponding input.

Learn about creating multimodal datasets here.

Data Curation

We understand that having high-quality datasets from the start is challenging and that datasets need to evolve continuously. Maxim’s platform ensures data curation possibilities at every stage of the lifecycle, allowing datasets to improve constantly.

Learn more about data curation here.

Splits

Dataset splits allow you to create subsets of a larger dataset for different use cases or logical groupings. Creating a split helps manage datasets more easily and facilitates testing of specific prompts or workflows. You can create splits within a dataset, add elements to them, and view them side by side under different tabs. Learn more about dataset splits here.

Prompt partials

Prompt partials act as reusable snippets that you can include across different Prompts. They help eliminate repetitive content by maintaining common prompt elements in a single place. Learn more about prompt partials here.

On this page