Evaluators

Human-in-the-loop

Even with holistic auto evaluation, we understand that having humans in the loop is critical for high quality evaluation. We have built the platform in order to integrate the human annotation workflows alongside other forms of evaluation throughout the development lifecycle.

Add human evaluators to test runs using the following steps:

Creating human evaluators

Add instructions, score type and pass criteria.

Select the relevant human evaluators while triggering a test run

Switch these on while configuring test run for a prompt or workflow

Set up human evaluation for this run

Choose method of annotation, add general instructions and emails of raters if applicable and configure sampling rate

Collect ratings via test report columns or via email

Based on method chosen, annotators can add their ratings on the run report or external dashboard link sent on their email.

See summary of human ratings and deep dive into particular cases

As a part of the test report, you can view status of rater inputs, rating details and add corrected outputs to dataset

Creating human evaluators

Custom human evaluators can be created with specific criteria for rating. You can add instructions that will be sent alongside the evaluator so that human annotators or SMEs are aware of the logic for grading. You can also define the evaluation score type and pass criteria like all other evaluators.

Select human evaluators while triggering a test run

On the test run configuration panel for a workflow or prompt, you can switch on the relevant human evaluators from the list. When you click on ‘Trigger test run’, if any human evaluators were chosen, you will see a popover to set up the human evaluation.

Set up human evaluation for this run

The human evaluation set requires the following choices

Method
- Annotate on report - Columns will be added to existing report for all editors to add ratings
- Send via email - People within or outside your organization can submit ratings. The link sent is accessible separately and does not need a paid seat on your Maxim organization.
If you choose to send evaluation requests via email, you need to provide the emails of the raters and instructions to be sent.
For email based evaluation requests to SMEs or external annotators, we make it easy to send only required entries using a sampling rate. Sampling rate can be defined in 2 ways:
- Percentage of total entries - This is relevant for large datasets where in it’s not possible to manually rate all entries
- Custom logic - This helps send entries of a particular type to raters. Eg. Ratings which have a low score on the Bias metric (auto eval). By defining these rules, you can make sure to use your SME’s time on the most relevant cases.

Collect ratings via test report columns

Human annotations can be added to the test report columns directly by all editors of the report. Clicking on ‘select rating’ button will show a popover with all the evaluators that need ratings. Comments can also be added for each rating. In case the output is not upto the mark, the rater can choose to submit a re-written output as well.

If one rater has already provided ratings, a different rater can still add their inputs. Hovering on the row will reveal an icon button near the previous value as shown before. They can then proceed to add ratings as mentioned above. Average rating across raters will be shown for that evaluator and considered for the overall results calculations.

Collect ratings via email

On completion of the test run, emails will be sent to all raters provided during set up. This email will contain the requester name and instructions along with the link to the rater dashboard.

The dashboard is accessible externally without a paid slot on Maxim. You can send this to external annotation teams or SMEs who might be helping with annotation needs. As soon as a rater has started evaluating via the dashboard, you will see the status of evaluation change from ‘pending’ to ‘in-progress’ on the test run summary.

Human raters can go through the query, retrieved context, output and expected output (if applicable) for each entry and then provide their ratings accordingly for each evaluation metric. They can also add comments or re-write the output for a particular entry. On completion of a ratings for a particular entry they can save and proceed and these values will start reflecting on the Maxim test run report. They can repeat this process for all entries and once completed you will be able to see the status change on the run report and view details.

Analyse human ratings

Summary scores and pass/fail results for the human ratings will be shown along side all other auto evaluation results in the test run report. To view the detailed ratings by a particular individual, click the view details button next to their email ID and go through the table provided. If there are particular cases where in you would like to use the human corrected output to build ground truth in your datasets, you can do this via the data curation workflow. More details on that are mentioned here

Programmatic Evaluator

Overview