Runs

Parts of a run report

Test summary

The left section of the report gives you an overall summary of the status of the run, what was run, time taken, cost, etc. This helps identify details of a run if you come back to it. You can directly open the prompt/workflow and dataset used from the links in this section.

Test run summary

Evaluation summary

The evaluation summary card gives you an overall picture of the evaluator results for those that you have chosen.

  1. Every evaluator has its own column with pass/fail result, average mean score and pass rate (aggregated across all queries). The result depends on your evaluator score configurations set up earlier. If you have chosen a lot of evaluators, scroll horizontally to see their results.
  2. If its a comparison run, you will see grouped rows for each of the elements. Eg. Prompt V1 vs Prompt v2. This way you can quickly see how the overall scores compare across prompts, versions or workflows that you chose.

Test run summary

Performance metrics charts

For every run, its important for you to have the data about latency, cost and token usage. The graphs shown within the report give you a sense of this information across entities.

  1. Cost (in $) - Total, input and completion. Multiple bars will be shown for comparison runs. If you have custom pricing for certain models, set it up within settings for it to reflect here.
  2. Tokens used - Total, input and completion. Multiple bars will be shown for comparison runs.
  3. Latency - p50, p90, p95 and p99 as well as min, max, mean, and standard deviation.

You can hover over any datapoint to see the exact numbers.

Test run details

Detailed run table

This table at the end of the report shows the details of every entry and its result. In case of a comparison run, the default view will have both entities compared results for every entry. To view only one entry, you can switch the tab at the top of the table.

This table is full customisable to view it as per your requirements. Learn more about how to customise it here.

The table consists of the following data:

  1. Status - This could be queued, running, completed, failed or stopped.
  2. Input + Dataset columns (optional) - Input and expected output is shown by default. Other columns can be toggled on if you require to view them.
  3. Run outputs - This includes retrieved context and final output from the LLM
  4. Latency information
  5. Evaluator scores - Each evaluator score is shown in its own column. Human evaluators added will also be visible here.

Test run details

Test entry run details

To understand exactly what happened within the run of each entry, you can view its details by clicking on the row. In the sheet that opens, you will be able to get insights into exact output, performance and evaluation details. This contains the following information under tabs:

  1. Overview - Deep dive into Input, expected output, LLM output, tokens and cost. You can switch the view from markdown to plain text via the toggle on hover.
  2. Messages - Exact messages that were sent to the LLM. You can choose to work with these further by clicking on ‘Open in playground’
  3. Evaluations - This tab provides the reasoning for evaluation scores
  4. Logs - Helps debug case where things may have failed
  5. Stats - Breakdown of metrics like cost or tokens in terms of multiple steps.

Test run details

On this page