While machine learning models can provide a baseline evaluation, they may not always capture the nuances of human perception, simply because they lack the ability to understand context and emotions behind some scenarios. Humans, in these scenarios, can also provide better comments and insights. This makes it essential to also have humans be a part of the evaluation process.
Human evaluation on logs are very similar to how human annotation is done on test runs, in fact, the Human Evaluators used in test runs are also used here. Let's see how we can set up a human evaluation pipeline for our logs.
Before you start
You need to have your logging set up to capture interactions between your LLM and users before you can evaluate them. To do so, you would need to integrate Maxim SDK into your application.
Also if you do not have a Human Evaluator created in your workspace, please create one by navigating to the Evaluators () tab from the sidebar, as we will need it to setup the human evaluation pipeline.
Navigate to the repository where you want to setup human evaluation on logs.
Click on Configure evaluation in the top right corner of the page and choose the Setup evaluation configuration option. This will open up the evaluation configuration sheet.
Setup evaluation configuration
Create annotation queue
We need to focus on the Human Evaluation section below. Here we will see a dropdown under Select evaluators, we need to choose Human Evaluators to use for our evaluation from here.
This will setup what evaluation we want to do upon our logs. Now we need to setup filtering criteria to determine which logs should be evaluated as evaluating all logs by hand can get out of hand very fast.
We talked about the Auto evaluation section above. You can learn more about using other types of evaluators to evaluate your logs there.
Before we setup the filtering criteria though, we need to save this configuration, do this by clicking on the Save configuration button.
Now to get to filtering criteria, we will click on Configure evaluation in the top right corner of the page again but choose the View annotation queue option this time. You will be taken to the annotation queue page.
Here we will see a Set up queue logic button, click on it to setup the logic for the queue and click on the Save queue logic button finally to save.
Human evaluation is now setup to automatically keep adding logs that match the certain criteria you have given to the queue; over which annotation can now happen and thus be evaluated! ✍🏻
Manually add logs to the queue by:
Selecting the logs you want to add to the queue by clicking on the checkboxes at the left of each log
Clicking on the Add to annotation queue button and you're done!
Here each added log will have their human evaluators' scores displayed. The scores would be the average score of all the annotations done for an evaluator by different users. On editing the score, the individual score along with comment and rewritten output (if any) of the user editing the score will be shown with the ability to edit all of them.
On opening the annotation queue page, you will see a list of logs that have been added to the queue beside which there will be a Select rating dropdown.
Clicking on the Select rating dropdown will open a modal where you can select a rating for the log and optionally add a comment or provide a rewritten output if necessary.
You can also click on one of the entries to open annotation sheet, wherein you can see the complete input and output and rate for all the evaluators in each entry at once.
After scoring an entry, click on the Save and next button to move to the next log/entry and score it.
On opening any trace, you will find a Details and Evaluation tab. The Evaluation tab here would display all the evaluations on that happened on the trace. We will focus on the Human Evaluators here but in order to make sense of other evaluators in this sheet you can refer to Auto Evaluation -> Making sense of evaluations on logs
The trace evaluation overview tab shows the average score of each Human Evaluator and Rewritten Outputs, if present, by each individual user.
Going further into each individual Human Evaluator, we see its Score (avg.) and Result (whether the particular evaluator's evaluation passed or failed). We also see a breakdown of the scores and their corresponding comments, if any, given by each user in this tab, thus giving you a granular view of the evaluation as well.