Set Up Auto Evaluation on Logs

Why evaluate logs?

We know that evaluation is a necessary step while building an LLM, but since an LLM can be non-deterministic, all possible scenarios can never be covered; thus evaluating the LLM on live system also becomes crucial.

Evaluation on logs helps cover cases or scenarios that might not be covered by Test runs, ensuring that the LLM is performing optimally under various conditions. Additionally, it allows for potential issues to be identified early on which allows for making necessary adjustments to improve the overall performance of the LLM in time.

Before you start

You need to have your logging set up to capture interactions between your LLM and users before you can evaluate them. To do so, you would need to integrate Maxim SDK into your application.

Setting up auto evaluation

Navigate to repository

Navigate to the repository where you want to evaluate your logs.

Access evaluation configuration

Click on Configure evaluation in the top right corner of the page and choose the Setup evaluation configuration option. This will open up the evaluation configuration sheet.

Configure auto evaluation settings

The sheet’s Auto Evaluation section has 3 parts:

Select evaluators: Choose the evaluators you want to use for your evaluation.
Filters: Setup filters to only evaluate logs that meet a certain criteria.
Sampling: Choose a sampling rate, this will help you control the amount of logs that are evaluated and prevent evaluating every log; which could potentially lead to very high costs.

The Human Evaluation section below is explained in the Set up human evaluation on logs section

Save configuration

Finally click on the Save configuration button.

The configuration is now done and your logs should start evaluating automatically based on the filters and sampling rate you have set up! 🎉

Making sense of evaluations on logs

In the logs’ table view, you can find the evaluations on a trace in its row towards the left end, displaying the evaluation scores. You can sort the logs by evaluation scores as well by clicking on either of the evaluators’ column header.

Click the trace to view detailed evaluation results. In the sheet, you will find the Evaluation tab, wherein you can see the evaluation in detail.

The evaluation tab displays many details regarding the evaluation of the trace, let us see how you can navigate through them and get more insights into how your LLM is performing.

Evaluation summary

Evaluation summary displays the following information (top to bottom, left to right):

How many evaluators passed out of the total evaluators across the trace
How much did all the evaluators’ evaluation cost
How many tokens were used across the all evaluators’ evaluations
What was the total time taken for the evaluation to process

Trace evaluation card

In each card, you will find a tab switcher on the top right corner, this is used to navigate through the evaluation’s details. Here is what you can find in in different tabs:

Overview tab

All the evaluators run on the trace level and their scores are displayed in a table here along with whether the evaluator passed or failed.

Individual evaluator’s tab

This tab contains the following sections:

Result: Shows whether the evaluator passed or failed.
Score: Shows the score of the evaluator.
Reason (shown where applicable): Displays the reasoning behind the score of the evaluator, if given.
Cost (shown where applicable): Shows the cost of the individual evaluator’s evaluation.
Tokens used (shown where applicable): Shows the number of tokens used by the individual evaluator’s evaluation.
Model latency (shown where applicable): Shows the time taken by the model to respond back with a result for an evaluator.
Time taken: Shows the time taken by the evaluator to evaluate.
Variables used to evaluate: Shows the values that were used to replace the variables with while processing the evaluator.
Logs: These are logs that were generated during the evaluation process. They might be useful for debugging errors or issues that occurred during the evaluation.

Tree view on the left panel

This view is essential for when you are evaluating the each log on the node level, essentially on each component of the trace (like a generation or retrieval, etc). This view helps with your perception as to what component’s evaluation you are looking at on the right panel (and the component’s place in the trace as well). We discuss more about the Node Level Evaluation further down.

Dataset curation

Once you have logs and evaluations in Maxim, you can easily curate datasets by filtering and selecting logs based on different criteria.

Filter logs with specific evaluation scores (e.g., bias score greater than 0)

Select all filtered logs using the top-left selector

Click the `Add to dataset` button that appears

Choose to add logs to an existing dataset or create a new dataset. Map the columns and click `Add entries`

Introduction

Offline Evals

Online Evals

Tracing

Simulations

Library

Dashboards

Integrations

Settings

Set Up Auto Evaluation on Logs

Why evaluate logs?

Setting up auto evaluation

Making sense of evaluations on logs

Evaluation summary

Trace evaluation card

Overview tab

Individual evaluator’s tab

Tree view on the left panel

Dataset curation

Introduction

Offline Evals

Online Evals

Tracing

Simulations

Library

Dashboards

Integrations

Settings

​Why evaluate logs?

​Setting up auto evaluation

​Making sense of evaluations on logs

​Evaluation summary

​Trace evaluation card

​Overview tab

​Individual evaluator’s tab

​Tree view on the left panel

​Dataset curation

Why evaluate logs?

Setting up auto evaluation

Making sense of evaluations on logs

Evaluation summary

Trace evaluation card

Overview tab

Individual evaluator’s tab

Tree view on the left panel

Dataset curation