Evaluators

Concepts

Evaluator Store

A large set of pre-built evaluators are available for you to use directly. These can be found in the evaluator store and added to your workspace in a single click.

At Maxim, our pre-built evaluators are of two categories:

  1. Maxim-created Evaluators: These are evaluators created, benchmarked, and managed by Maxim. There are 3 kinds of Maxim-created evaluators:

    1. AI Evaluators: As the name suggests, these are evaluators who use other large language models to evaluate your application (LLM-as-a-Judge).

    2. Statistical Evaluators: These are traditional ML metrics such as BLEU, ROUGE, WER, TER etc.

    3. Programmatic Evaluators: These are javascript functions for common use cases like validJson, validURL etc that help you validate your responses.

  2. Third-party Evaluators: We have also enabled popular third-party libraries for evaluation, e.g., RAGAS, on the platform so you can use them in your evaluation workflows with just a few clicks. If you have any custom request for integration please free to drop us a note.

Within the store you can search for an evaluator or filter as per tags of type. Simply click the 'Add to workspace' button to have it available for use by your team.

If you want us to build a specific evaluator for your needs, please drop a line at [email protected]

Custom Evaluators

While we make a lot of evaluators available for common use cases right out of the box for your use, we understand that there are sometimes application-specific requirements. Keeping that in mind, the platform allows for easy creation of custom evaluators of the following types:

AI Evaluators

As the name suggests, these are evaluators using other LLMs to evaluate your application. You can configure different prompts, models, and scoring strategies depending on your use case. Once tested in the playground, you can start using the evaluators in your workflows.

Programmatic Evaluators

These are JavaScript functions where you can write your own custom logic. You can use the {input}, {output}, and {expectedOutput} variables which will pull the relevant data from that dataset column or the response of the run to execute the evaluator.

API-based Evaluators

If you have built your own evaluation model for specific use cases, you can easily expose the model using an HTTP endpoint and use that within Maxim for evaluation.

Human Evaluators

This allows for the last mile of evaluation with human annotators in the loop. You can create a Human evaluator for specific criteria that you want annotators to evaluate on. During a test run, you can simply attach the evaluators, add details of the raters, and choose the sample set for human annotation. You can learn more about how the platform powers the entire human evaluation lifecycle here.

Every evaluator should return a score and reasoning, which is then analyzed and used to summarise the results according to your criteria.

Evaluator Grading

Every evaluator's grading configuration has 2 parts:

  1. Type of scale - Yes/No, Scale of 1-5, etc.
    1. For AI evaluators, this can be chosen, and an explanation is needed for grading logic.
    2. For programmatic evaluators, the relevant response type can be configured.
    3. For API-based evaluators, you can map the field to be used for score.
  2. Pass criteria - This includes configuration for 2 levels:
    1. The score at which the evaluator should pass for a given query.
    2. Percentage of queries that need to pass for evaluator to pass at run level across all dataset entries

For custom evaluators both of these are configurable while for pre-built evaluators you can define your pass criteria.

The evaluator below gives a score between 0 to 1. Pass criteria has been defined such that if a query is passed if it scores more than or equal to 0.8 and for the entire report the evaluator 'Clarity' will pass if 80% of the queries pass i.e. have more than or equal to 0.8.

Evaluator Reasoning

In order for you to effectively analyse why certain cases are doing well or under performing, we provide a clear reasoning for each evaluator score. This can be viewed for each entry within the evaluation tab on its details sheet.

On this page