Import or Create Datasets

Datasets are collections of data used for training, testing, and evaluating AI models within workflows and evaluations. Test your prompts, http agents or no-code agents across test cases in this dataset and view results at scale. Begin with a template and customize column structure. Evolve your datasets over time from production logs or human annotation.

Create Datasets Using Templates

Create Datasets quickly with predefined structures using our templates:

Prompt or Workflow Testing

Choose this template for single-turn interactions based on individual inputs to test prompts or workflows.

Example: Input column with prompts like “Summarize this article about climate change” paired with an Expected Output column containing ideal responses.

Agent Simulation

Select this template for multi-turn simulations to test agent behaviors across conversation sequences.

Example: Scenario column with “Customer inquiring about return policy” and Expected Steps column outlining the agent’s expected actions.

Dataset Testing

Use this template when evaluating against existing output data to compare expected and actual results.

Example: Input column with “What’s the weather in New York?” and Expected Output column with “65°F and sunny” for direct evaluation.

Add Images to Your Dataset

You can enhance your datasets by including images alongside other data types. This is particularly useful for:

Visual content evaluation
Image-based prompts and responses
Multi-modal testing scenarios

Add Images to Your Dataset

You can add images to your Dataset by creating a column of type Images. We support both URL and local file paths.

When working with images in datasets:

Supported formats include common image types (PNG, JPG, JPEG, GIF)
For URLs, ensure they are publicly accessible
For local files, maintain consistent file paths across your team

Column types

Scenario

The Scenario column type allows you to define specific situations or contexts for your test cases. Use this column to describe the background, user intent, or environment in which an interaction takes place. Scenarios help guide agents or models to respond appropriately based on the described situation.

Examples:

“A customer wants to buy an iPhone.”
“A user is trying to cancel their subscription.”
“A student asks for help with a math problem.”

Scenarios are especially useful for simulating real-world conversations, testing agent behaviors, and ensuring your models handle a variety of user intents and contexts. Use this column when you are using this dataset for agent simulation runs.

Expected Steps

The Expected Steps column type allows you to specify the sequence of actions or decisions that an agent should take in response to a given scenario. This helps users clearly outline the ideal process or workflow, making it easier for evaluators to verify whether the agent is behaving as intended.

Use this column to break down the expected agent behavior into individual, logical steps. This is especially useful for multi-turn interactions or complex tasks where the agent’s reasoning and actions need to be evaluated step by step.

Example:

- Greet the customer and ask how you can help.
- Look up the customer's order history.
- Provide information about the return policy.
- Offer to initiate a return if eligible.

Including expected steps in your dataset enables more granular evaluation and helps ensure that agents follow the correct procedures during simulations or tests.

Expected Tool Calls

The Expected Tool Calls column type allows you to specify which tools (such as APIs, functions, or plugins) you expect an agent to use in response to a scenario. This is especially useful when running prompt runs, where you want to evaluate whether the agent is choosing and invoking the correct tools as part of its reasoning process.

Use this column to list the names of the tools or actions that should be called, optionally including parameters or expected arguments. This helps ensure that the agent’s tool usage aligns with your expectations for the task.

Examples:

“search_web”
“get_weather(location=‘San Francisco’)”
“send_email(recipient, subject, body)”

Including expected tool calls in your dataset enables more precise evaluation of agent behavior, particularly in scenarios where tool usage is critical to task completion.

`inAnyOrder`

[
    {
        "inAnyOrder": [
            {
                "name": "list_commits",
                "arguments": {
                    "owner": "facebook",
                    "repo": "react"
                }
            },
            {
                "name": "list_branches",
                "arguments": {
                    "owner": "facebook",
                    "repo": "react"
                }
            },
            {
                "name": "list_tags",
                "arguments": {
                    "owner": "facebook",
                    "repo": "react"
                }
            }
        ]
    }
]

This combinator indicates that all listed tool calls are mandatory, but they may be executed in any order; any ordering is considered valid.

`anyOne`

[
     {
        "anyOne": [
            {
                "name": "get_pull_request_reviews",
                "arguments": {
                    "owner": "facebook",
                    "repo": "react",
                    "pullNumber": 25678
                }
            },
            {
                "name": "get_pull_request_comments",
                "arguments": {
                    "owner": "facebook",
                    "repo": "react",
                    "pullNumber": 25678
                }
            }
        ]
    }
]

The anyOne combinator is used when any one of several possible tool calls is acceptable to fulfill the requirement. This is useful in scenarios where there are multiple valid ways for an agent to achieve the same outcome, and you want to allow for flexibility in the agent’s approach.

For example, in the following JSON, either get_pull_request_reviews or get_pull_request_comments (with the specified arguments) will be considered a valid response. The agent only needs to make one of these tool calls to satisfy the expectation.

Conversation History

Conversation history allows you to include a chat history while running Prompt tests. The sequence of messages sent to the LLM is as follows:

messages in the prompt version
history
input column in the dataset.

Format

Conversation history is always a JSON array

[
    {
        "role": "user",
        "content" "This is string content"
    },
    {
        "role": "user",
        "content" : [
            {
                "type": "text",
                "text": "This is with image attachment"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://url-image.com",
                    "detail": "low"
                }
            }
        ]
    }
]

Introduction

Offline Evals

Online Evals

Tracing

Simulations

Library

Dashboards

Integrations

Settings

Import or Create Datasets

Create Datasets Using Templates

Prompt or Workflow Testing

Agent Simulation

Dataset Testing

Add Images to Your Dataset

Column types

Scenario

Expected Steps

Expected Tool Calls

`inAnyOrder`

`anyOne`

Conversation History

Format

Introduction

Offline Evals

Online Evals

Tracing

Simulations

Library

Dashboards

Integrations

Settings

​Create Datasets Using Templates

​Prompt or Workflow Testing

​Agent Simulation

​Dataset Testing

​Add Images to Your Dataset

​Column types

​Scenario

​Expected Steps

​Expected Tool Calls

​inAnyOrder

​anyOne

​Conversation History

​Format

Create Datasets Using Templates

Prompt or Workflow Testing

Agent Simulation

Dataset Testing

Add Images to Your Dataset

Column types

Scenario

Expected Steps

Expected Tool Calls

`inAnyOrder`

`anyOne`

Conversation History

Format