Before you start
You need to have your logging set up to capture interactions between your LLM and users before you can evaluate them. To do so, you would need to integrate Maxim SDK into your application.
Evaluate captured logs automatically from the UI based on filters and sampling
We know that evaluation is a necessary step while building an LLM, but since an LLM can be non-deterministic, all possible scenarios can never be covered; thus evaluating the LLM on live system also becomes crucial.
Evaluation on logs helps cover cases or scenarios that might not be covered by Test runs, ensuring that the LLM is performing optimally under various conditions. Additionally, it allows for potential issues to be identified early on which allows for making necessary adjustments to improve the overall performance of the LLM in time.
Before you start
You need to have your logging set up to capture interactions between your LLM and users before you can evaluate them. To do so, you would need to integrate Maxim SDK into your application.
Navigate to the repository where you want to evaluate your logs.
Click on Configure evaluation
in the top right corner of the page and choose the Setup evaluation configuration
option. This will open up the evaluation configuration sheet.
The sheet's Auto Evaluation
section has 3 parts:
Select evaluators
: Choose the evaluators you want to use for your evaluation.Filters
: Setup filters to only evaluate logs that meet a certain criteria.Sampling
: Choose a sampling rate, this will help you control the amount of logs that are evaluated and prevent evaluating every log; which could potentially lead to very high costs.The Human Evaluation
section below is explained in the Set up human evaluation on logs section
Finally click on the Save configuration
button.
The configuration is now done and your logs should start evaluating automatically based on the filters and sampling rate you have set up! 🎉
In the logs' table view, you can find the evaluations on a trace in its row towards the left end, displaying the evaluation scores. You can sort the logs by evaluation scores as well by clicking on either of the evaluators' column header.
Click the trace to view detailed evaluation results. In the sheet, you will find the Evaluation
tab, wherein you can see the evaluation in detail.
The evaluation tab displays many details regarding the evaluation of the trace, let us see how you can navigate through them and get more insights into how your LLM is performing.
Evaluation summary displays the following information (top to bottom, left to right):
In each card, you will find a tab switcher on the top right corner, this is used to navigate through the evaluation's details. Here is what you can find in in different tabs:
All the evaluators run on the trace level and their scores are displayed in a table here along with whether the evaluator passed or failed.
This tab contains the following sections:
This view is essential for when you are evaluating the each log on the node level, essentially on each component of the trace (like a generation or retrieval, etc). This view helps with your perception as to what component's evaluation you are looking at on the right panel (and the component's place in the trace as well). We discuss more about the Node Level Evaluation further down.