Import or Create Datasets
Learn how to import or create datasets in Maxim
Datasets are collections of data used for training, testing, and evaluating AI models within workflows and evaluations. Test your prompts, http agents or no-code agents across test cases in this dataset and view results at scale. Begin with a template and customize column structure. Evolve your datasets over time from production logs or human annotation.
Create Datasets Using Templates
Create Datasets quickly with predefined structures using our templates:
Prompt or Workflow Testing
Choose this template for single-turn interactions based on individual inputs to test prompts or workflows.
Example: Input column with prompts like “Summarize this article about climate change” paired with an Expected Output column containing ideal responses.
Agent Simulation
Select this template for multi-turn simulations to test agent behaviors across conversation sequences.
Example: Scenario column with “Customer inquiring about return policy” and Expected Steps column outlining the agent’s expected actions.
Dataset Testing
Use this template when evaluating against existing output data to compare expected and actual results.
Example: Input column with “What’s the weather in New York?” and Expected Output column with “65°F and sunny” for direct evaluation.
Add Images to Your Dataset
You can enhance your datasets by including images alongside other data types. This is particularly useful for:
- Visual content evaluation
- Image-based prompts and responses
- Multi-modal testing scenarios
Add Images to Your Dataset
You can add images to your Dataset by creating a column of type Images. We support both URL and local file paths.
When working with images in datasets:
- Supported formats include common image types (PNG, JPG, JPEG, GIF)
- For URLs, ensure they are publicly accessible
- For local files, maintain consistent file paths across your team
Column types
Scenario
The Scenario column type allows you to define specific situations or contexts for your test cases. Use this column to describe the background, user intent, or environment in which an interaction takes place. Scenarios help guide agents or models to respond appropriately based on the described situation.
Examples:
- “A customer wants to buy an iPhone.”
- “A user is trying to cancel their subscription.”
- “A student asks for help with a math problem.”
Scenarios are especially useful for simulating real-world conversations, testing agent behaviors, and ensuring your models handle a variety of user intents and contexts. Use this column when you are using this dataset for agent simulation runs.
Expected Steps
The Expected Steps column type allows you to specify the sequence of actions or decisions that an agent should take in response to a given scenario. This helps users clearly outline the ideal process or workflow, making it easier for evaluators to verify whether the agent is behaving as intended.
Use this column to break down the expected agent behavior into individual, logical steps. This is especially useful for multi-turn interactions or complex tasks where the agent’s reasoning and actions need to be evaluated step by step.
Example:
Including expected steps in your dataset enables more granular evaluation and helps ensure that agents follow the correct procedures during simulations or tests.
Expected Tool Calls
The Expected Tool Calls column type allows you to specify which tools (such as APIs, functions, or plugins) you expect an agent to use in response to a scenario. This is especially useful when running prompt runs, where you want to evaluate whether the agent is choosing and invoking the correct tools as part of its reasoning process.
Use this column to list the names of the tools or actions that should be called, optionally including parameters or expected arguments. This helps ensure that the agent’s tool usage aligns with your expectations for the task.
Examples:
- “search_web”
- “get_weather(location=‘San Francisco’)”
- “send_email(recipient, subject, body)”
Including expected tool calls in your dataset enables more precise evaluation of agent behavior, particularly in scenarios where tool usage is critical to task completion.
inAnyOrder
This combinator indicates that all listed tool calls are mandatory, but they may be executed in any order; any ordering is considered valid.
anyOne
The anyOne
combinator is used when any one of several possible tool calls is acceptable to fulfill the requirement. This is useful in scenarios where there are multiple valid ways for an agent to achieve the same outcome, and you want to allow for flexibility in the agent’s approach.
For example, in the following JSON, either get_pull_request_reviews
or get_pull_request_comments
(with the specified arguments) will be considered a valid response. The agent only needs to make one of these tool calls to satisfy the expectation.
Conversation History
Conversation history allows you to include a chat history while running Prompt tests. The sequence of messages sent to the LLM is as follows:
- messages in the prompt version
- history
- input column in the dataset.
Format
- Conversation history is always a JSON array