maxim updates

Maxim AI - Product Updates, December 2024 ✨

Feature spotlight

👩‍🏫 Use human annotation queues for data curation

To optimize the quality and performance of AI applications, teams need to continuously inspect traces, annotate examples, and use these logs as datasets for evaluation, fine-tuning, or in-context learning. This is a manually intensive process where various components are siloed and lack a structured approach.

The human annotations workflows on Maxim streamline this process by allowing teams to:

Create human annotation queues, either by automating logic for annotation queues (e.g., 👎🏼 user feedback or low Faithfulness score) or by manually selecting examples based on certain filters.
Collect human reviews across multiple dimensions (e.g., fact check, bias). The human review could entail a score, reasoning/comments, and a rewritten output.
Assign human annotation tasks to one or multiple reviewers as required.
Curate datasets for evals, experimentation, or fine-tuning, by leveraging auto-eval and human scores and human annotated outputs.

AI teams are using a combination of auto-evals and human reviews to increase the reliability of their AI applications. Try it out here.

📊 Introducing analytics dashboards in Maxim!

Teams are running multiple experiments on our platform, and we want to make it effortless to analyze progress over time. You can now generate reports to compare past experiment runs, visualize score differences and pass/fail results, and share them easily.

Compare prompts, prompt chains, and AI workflows across runs.
Create side-by-side comparisons, analyze trends, and optimize.
Collaborate on custom dashboards and share them with one click, internally or externally.

Create your first comparison report with Maxim.

📋 Trigger test runs using NodeJS SDK

Maxim's SDK makes it seamless to test your AI workflows—right from your local environment. No more repetitive data uploads or back-and-forth interactions with the platform.

Our SDK support for NodeJS makes quality assurance faster, more efficient, and developer-friendly:

Flexible data sources: Test with local CSV files or other data sources as datasets.
Local testing: Run tests directly on your machine without data uploads.
Seamless monitoring: Track test statuses in the Maxim dashboard, just like regular runs.

Get started here.

Feature round-up

📤 Export log data as CSV

Users can now export logs along with their evaluation data (scores and feedback) in a single CSV file. Navigate to the "Logs" window, select your desired log repository, and click the "Export CSV" button to export your tracing data.

💰 Track evaluation cost for LLM-as-a-judge evals

To give users better visibility into their evaluation costs when using LLM-as-a-judge evaluators, we’ve introduced the "Evaluation cost" column in test run reports. This column displays the evaluation cost incurred per query and supports filtering, sorting.

📥 Import cURL directly as Workflow

Workflow is one of the most loved features on Maxim, enabling users to test their AI applications through API endpoints. Setting up workflows is now easier than ever—simply paste your cURL command directly into the address bar to get started.

🧊 Region-specific deployments with AWS Bedrock

Users can now perform region-specific model deployments using the INFERENCE_PROFILE inference type on AWS Bedrock. To add these models:

Go to the Settings page, click on models, and "head to Bedrock" tab
Click "Add new" and enter the Inference Profile ID with corresponding ARN.

0️⃣ Normalize evaluator scores

For consistent reporting across test runs and evaluations, users can now normalize the score of their evals to a 0-1 range.

Upcoming releases

Granular evaluation for agentic workflows: Evaluate the performance and quality of any node of your AI agent by attaching relevant evaluators for each generation, retrieval, or tool call. Visualise, debug, and optimize your workflow via a collaborative dashboard.
Agent simulation: Simulate multi-turn interactions with your AI agent and evaluate end-to-end sessions across scenarios.

Knowledge nuggets

Inside OpenAI’s o1

The o1 family of models uses reinforcement learning to enhance complex reasoning, incorporating Chain-of-Thought (CoT) to refine decision-making. By shifting from fast, intuitive thinking to slower, more deliberate reasoning, o1 models explore various strategies, identify mistakes, and align with safety guidelines to avoid unsafe content. Trained on a combination of public and proprietary datasets, including reasoning data and scientific literature, these models undergo rigorous data filtering to remove personally identifiable information (PII), ensuring compliance with safety and privacy standards.

Innovative training of LLMs in continuous latent spaces

Coconut (Chain of Continuous Thought) improves upon traditional language-based reasoning methods like Chain-of-Thought (CoT) by shifting reasoning from "language space" to a continuous latent space. This approach uses hidden states as input embeddings, allowing for more flexible and efficient problem-solving. By encoding multiple reasoning paths simultaneously, Coconut supports better decision-making in tasks requiring planning and backtracking. Experiments demonstrate that it outperforms CoT in tasks such as math and logic while generating fewer tokens, improving computational efficiency.