Maxim Logo
How toEvaluate logs

Set up human evaluation on logs

Use human evaluation or rating to assess the quality of your logs and evaluate them.

The need for human evaluation

While machine learning models can provide a baseline evaluation, they may not always capture the nuances of human perception, simply because they lack the ability to understand context and emotions behind some scenarios. Humans, in these scenarios, can also provide better comments and insights. This makes it essential to also have humans be a part of the evaluation process.

Human evaluation on logs are very similar to how human annotation is done on test runs, in fact, the Human Evaluators used in test runs are also used here. Let's see how we can set up a human evaluation pipeline for our logs.

Before you start

You need to have your logging set up to capture interactions between your LLM and users before you can evaluate them. To do so, you would need to integrate Maxim SDK into your application.

Also if you do not have a Human Evaluator created in your workspace, please create one by navigating to the Evaluators () tab from the sidebar, as we will need it to setup the human evaluation pipeline.

Setting up human evaluation

Navigate to the repository where you want to setup human evaluation on logs.

Click on Configure evaluation in the top right corner of the page and choose the Setup evaluation configuration option. This will open up the evaluation configuration sheet.

Setup evaluation configuration
Create annotation queue

We need to focus on the Human Evaluation section below. Here we will see a dropdown under Select evaluators, we need to choose Human Evaluators to use for our evaluation from here.

This will setup what evaluation we want to do upon our logs. Now we need to setup filtering criteria to determine which logs should be evaluated as evaluating all logs by hand can get out of hand very fast.

Screenshot of the evaluation configuration sheet

We talked about the Auto evaluation section above. You can learn more about using other types of evaluators to evaluate your logs there.

Before we setup the filtering criteria though, we need to save this configuration, do this by clicking on the Save configuration button.

Now to get to filtering criteria, we will click on Configure evaluation in the top right corner of the page again but choose the View annotation queue option this time. You will be taken to the annotation queue page.

Screenshot of the annotation queue page

Here we will see a Set up queue logic button, click on it to setup the logic for the queue and click on the Save queue logic button finally to save.

Screenshot of queue logic setup form

Human evaluation is now setup to automatically keep adding logs that match the certain criteria you have given to the queue; over which annotation can now happen and thus be evaluated! ✍🏻

Manually add logs to the queue by:

  • Selecting the logs you want to add to the queue by clicking on the checkboxes at the left of each log
  • Clicking on the Add to annotation queue button and you're done!

Screenshot of the add to annotation queue button

Viewing annotations

There are 3 places where annotations can be viewed:

The annotation queue page

Here each added log will have their human evaluators' scores displayed. The scores would be the average score of all the annotations done for an evaluator by different users. On editing the score, the individual score along with comment and rewritten output (if any) of the user editing the score will be shown with the ability to edit all of them.

Screenshot of the annotation queue page with scores

Annotating the logs

On opening the annotation queue page, you will see a list of logs that have been added to the queue beside which there will be a Select rating dropdown.

Clicking on the Select rating dropdown will open a modal where you can select a rating for the log and optionally add a comment or provide a rewritten output if necessary.

Screenshot of the annotation modal

You can also click on one of the entries to open annotation sheet, wherein you can see the complete input and output and rate for all the evaluators in each entry at once.

After scoring an entry, click on the Save and next button to move to the next log/entry and score it.

Screenshot of the annotation sheet view

The logs table

Similar to how the evaluator scores are shown for auto evaluation, human evaluators are also shown (again, the average score is shown here)

Screenshot of the logs table with human evaluator scores

The trace details sheet (under the Evaluation tab)

On opening any trace, you will find a Details and Evaluation tab. The Evaluation tab here would display all the evaluations on that happened on the trace. We will focus on the Human Evaluators here but in order to make sense of other evaluators in this sheet you can refer to Auto Evaluation -> Making sense of evaluations on logs

The trace evaluation overview tab shows the average score of each Human Evaluator and Rewritten Outputs, if present, by each individual user.

Screenshot of the trace evaluation overview tab with rewritten outputs

Going further into each individual Human Evaluator, we see its Score (avg.) and Result (whether the particular evaluator's evaluation passed or failed). We also see a breakdown of the scores and their corresponding comments, if any, given by each user in this tab, thus giving you a granular view of the evaluation as well.

Screenshot of an individual human evaluator's evaluation with per user score breakdown

On this page