How can I Evaluate the Performance of Prompts with Maxim AI?

Evaluations on Maxim entail three core components:

The system you’re evaluating: You can evaluate individual prompts or end-to-end agents. Maxim allows you to run detailed comparison experiments across different prompts, models, parameters, contexts, and tool combinations.
Datasets: You run your evals against curated datasets. Maxim enables you to create multi-modal datasets and evolve them over time leveraging production logs and human feedback. You could also use synthetic data generation for dataset creation.
Evaluators: These are metrics tuned to your specific outcomes that you would use to evaluate agent quality. You can create your own custom metrics or leverage Maxim’s Evaluator Store of pre-built multi-modal evaluators. The platform also has deep support for human-in-the-loop workflows to help you balance auto-evals with nuanced human evaluations for AI quality. You can execute large-scale evals using these components through an intuitive no-code interface (ideal for Product Managers) or automate them via CI/CD workflows using our Go, TypeScript, Python, or Java SDKs. Additionally, you could run retroactive analysis to generate comparison reports uncovering trends over time and optimize your agents.

(See: Learn more about prompt evaluation here.)

Documentation Index