How can I Curate Datasets From My Production Logs?

Curating from Production Logs
Benefits of Production-Based Datasets
Curating from Human Annotations

Curating from Production Logs

The production log curation workflow in Maxim follows these steps:

Select relevant logs: Navigate to your log repository (preferably production) and use filters to identify high-quality examples, edge cases, or specific scenarios you want to preserve for testing
Initiate dataset creation: Select the logs you want to curate and click the “Add to Dataset” button in the top right corner
Choose or create dataset: Either add to an existing dataset or create a new one using Maxim’s pre-built templates (like “Dataset testing”) or custom column structures
Map log fields to dataset columns: Configure how log data maps to your dataset structure (e.g., Input field to Input column, Output to Output column, custom fields to reference data columns)
Finalize and access: Click “Add to Dataset” and receive a notification when processing is complete

Benefits of Production-Based Datasets

Curating from production logs provides several advantages:

Real user queries and interactions rather than hypothetical scenarios
Edge cases and failure modes discovered in production
Distribution of queries that matches actual usage patterns
Continuously evolving test coverage as your application grows

Curating from Human Annotations

For creating golden datasets with verified correct outputs:

Set up test runs and send results to human raters for annotation
Review completed ratings including comments and human-corrected outputs
Select high-quality annotated entries using row checkboxes
Map human-corrected outputs to ground truth columns in your golden dataset
Selectively include only the columns relevant to your evaluation needs

This dual approach to dataset curation ensures your evaluation suite remains relevant and comprehensive, combining the scale of automated production log collection with the quality assurance of human verification.

What are Traces and Spans in Agent Observability?

Can I Track Evaluation Costs and Token Usage at the Eval and Repository Levels?

FAQs

​Curating from Production Logs

​Benefits of Production-Based Datasets

​Curating from Human Annotations

Curating from Production Logs

Benefits of Production-Based Datasets

Curating from Human Annotations