Skip to main content
Evaluations on Maxim entail three core components:
  • The system you’re evaluating: You can evaluate individual prompts or end-to-end agents. Maxim allows you to run detailed comparison experiments across different prompts, models, parameters, contexts, and tool combinations.
  • Datasets: You run your evals against curated datasets. Maxim enables you to create multi-modal datasets and evolve them over time leveraging production logs and human feedback. You could also use synthetic data generation for dataset creation.
  • Evaluators: These are metrics tuned to your specific outcomes that you would use to evaluate agent quality. You can create your own custom metrics or leverage Maxim’s Evaluator Store of pre-built multi-modal evaluators. The platform also has deep support for human-in-the-loop workflows to help you balance auto-evals with nuanced human evaluations for AI quality. You can execute large-scale evals using these components through an intuitive no-code interface (ideal for Product Managers) or automate them via CI/CD workflows using our Go, TypeScript, Python, or Java SDKs. Additionally, you could run retroactive analysis to generate comparison reports uncovering trends over time and optimize your agents.
(See: Learn more about prompt evaluation here.)