Test Runs via SDK

JS/TS

Learn how to programmatically run test runs using Maxim's SDK with custom datasets, flexible output functions, and evaluations for your AI applications.

The SDK uses a builder pattern to configure and run tests. Here's a basic example:

Test run template
const result = await maxim
  .createTestRun("My First SDK Test", "your-workspace-id")
  .withDataStructure(/* your data structure here */)
  .withData(/* your data here */)
  .yieldsOutput(/* your output function here */)
  .withEvaluators(/* your evaluators here */)
  .run();

The workspace ID can be found in your URL following the /workspace path.

Understanding Data Structure

The data structure is a crucial concept that helps maintain type safety and validates your data columns. It maps your data columns to specific types that Maxim understands.

Basic Structure

The data structure is an object where keys are your column names and values are the column types.

const dataStructure = {
    myQuestionColumn: "INPUT",
    expectedAnswerColumn: "EXPECTED_OUTPUT",
    contextColumn: "CONTEXT_TO_EVALUATE",
    additionalDataColumn: "VARIABLE"
}

Available Types

  • INPUT - Main input text (only one allowed)
  • EXPECTED_OUTPUT - Expected response (only one allowed)
  • CONTEXT_TO_EVALUATE - Context for evaluation (only one allowed)
  • VARIABLE - Additional data columns (multiple allowed)
  • NULLABLE_VARIABLE - Optional data columns (multiple allowed)

Example

Using the data structure
const result = maxim
    .createTestRun("Question Answering Test", workspaceId)
    .withDataStructure({
        question: "INPUT",
        answer: "EXPECTED_OUTPUT",
        context: "CONTEXT_TO_EVALUATE",
        metadata: "NULLABLE_VARIABLE"
    })
    // ... rest of the configuration

Working with Data Sources

Maxim's SDK supports multiple ways to provide test data:

1. CSV Files

The CSVFile class provides a robust way to work with CSV files:

Using a CSV file
import { CSVFile } from '@maximai/maxim-js';
 
const myCsvFile = new CSVFile('./test.csv', {
    question: 0, // column index in CSV
    answer: 1,
    context: 2
});
 
const result = maxim
    .createTestRun("CSV Test Run", workspaceId)
    .withDataStructure({
        question: "INPUT",
        answer: "EXPECTED_OUTPUT",
        context: "CONTEXT_TO_EVALUATE"
    })
    .withData(myCsvFile)
    // ... rest of the configuration

The CSVFile class automatically validates your CSV headers against the data structure and provides type-safe access to your data.

2. Manual Data Array

For smaller datasets or programmatically generated data:

Using a manual data array
const manualData = [
    {
        question: "What is the capital of France?",
        answer: "Paris",
        context: "France is a country in Western Europe..."
    },
    {
        question: "Who wrote Romeo and Juliet?",
        answer: "William Shakespeare",
        context: "William Shakespeare was an English playwright..."
    }
];
 
const result = maxim
    .createTestRun("Manual Data Test", workspaceId)
    .withDataStructure({
        question: "INPUT",
        answer: "EXPECTED_OUTPUT",
        context: "CONTEXT_TO_EVALUATE"
    })
    .withData(manualData)
    // ... rest of the configuration

3. Platform Dataset

Use existing datasets from your Maxim workspace:

Using a platform dataset
const result = maxim
    .createTestRun("Platform Dataset Test", workspaceId)
    .withDataStructure({
        question: "INPUT",
        answer: "EXPECTED_OUTPUT",
        context: "CONTEXT_TO_EVALUATE"
    })
    .withData("your-dataset-id")
    // ... rest of the configuration

Custom Output Function

The output function is where you define how to generate responses for your test cases:

Implementing a custom output function
const result = maxim
    .createTestRun("Custom Output Test", workspaceId)
    .withDataStructure({
        question: "INPUT",
        answer: "EXPECTED_OUTPUT",
        context: "CONTEXT_TO_EVALUATE"
    })
    .withData(myData)
    .yieldsOutput(async (data) => {
        // Call your model or API
        const response = await yourModel.call(
            data.question,
            data.context
        );
 
        return {
            // Required: The actual output
            data: response.text,
 
            // Optional: Context used for evaluation
            // Returning a value here will utilize this context for
            // evaluation instead of the CONTEXT_TO_EVALUATE column (if provided)
            retrievedContextToEvaluate: response.relevantContext,
 
            // Optional: Performance metrics
            meta: {
                usage: {
                    promptTokens: response.usage.prompt_tokens,
                    completionTokens: response.usage.completion_tokens,
                    totalTokens: response.usage.total_tokens,
                    latency: response.latency
                },
                cost: {
                    input: response.cost.input,
                    output: response.cost.output,
                    total: response.cost.input + response.cost.output
                }
            }
        };
    })

If your output function throws an error, the entry will be marked as failed and you'll receive the index in the failedEntryIndices array after the run completes.

Adding Evaluators

Choose which evaluators to use for your test run:

Adding evaluators
const result = maxim
    .createTestRun("Evaluated Test", workspaceId)
    // ... previous configuration
    .withEvaluators(
        "Faithfulness", // names of evaluators installed in your workspace
        "Semantic Similarity",
        "Answer Relevance"
    )

Human Evaluation

For evaluators that require human input, setting up the human evaluation configuration is required and can be done as follows:

Setting up human evaluation
const result = maxim
    .createTestRun("Human Evaluated Test", workspaceId)
    // ... previous configuration
    .withEvaluators("Human Evaluator")
    .withHumanEvaluationConfig({
        emails: ["[email protected]"],
        instructions: "Please evaluate the response according to the evaluation criteria"
    })

Advanced Configuration

Concurrency Control

Manage how many entries are processed in parallel:

Configuring concurrency
const result = maxim
    .createTestRun("Concurrent Test", workspaceId)
    // ... previous configuration
    .withConcurrency(5) // Process 5 entries at a time

Timeout Configuration

Set custom timeout for long-running tests:

Configuring timeout
const result = await maxim
    .createTestRun("Long Test", workspaceId)
    // ... previous configuration
    .run(120) // Wait up to 120 minutes

Complete Example

Here's a complete example combining all the features:

Test run example
import { CSVFile, Maxim } from '@maximai/maxim-js';
import 'dotenv/config';
 
const maxim = new Maxim({
    apiKey: process.env.MAXIM_API_KEY,
});
 
// Initialize your data source
const testData = new CSVFile('./qa_dataset.csv', {
    question: 0,
    expected_answer: 1,
    context: 2,
    metadata: 3
});
 
try {
    const result = await maxim
        .createTestRun(`QA Evaluation ${new Date().toISOString()}`, 'your-workspace-id')
        .withDataStructure({
            question: "INPUT",
            expected_answer: "EXPECTED_OUTPUT",
            context: "CONTEXT_TO_EVALUATE",
            metadata: "NULLABLE_VARIABLE"
        })
        .withData(testData)
        .yieldsOutput(async (data) => {
            const startTime = Date.now();
 
            // Your model call here
            const response = await yourModel.generateAnswer(
                data.question,
                data.context
            );
 
            const latency = Date.now() - startTime;
 
            return {
                data: response.answer,
                // Returning a value here will utilize this context for
                // evaluation instead of the CONTEXT_TO_EVALUATE column
                // (in this case, the `context` column)
                retrievedContextToEvaluate: response.retrievedContext,
                meta: {
                    usage: {
                        promptTokens: response.tokens.prompt,
                        completionTokens: response.tokens.completion,
                        totalTokens: response.tokens.total,
                        latency
                    },
                    cost: {
                        input: response.cost.prompt,
                        output: response.cost.completion,
                        total: response.cost.total
                    }
                }
            };
        })
        .withEvaluators(
            "Faithfulness",
            "Answer Relevance",
            "Human Evaluator"
        )
        .withHumanEvaluationConfig({
            emails: ["[email protected]"],
            instructions: `Please evaluate the responses for accuracy and completeness. Consider both factual correctness and answer format.`
        })
        .withConcurrency(10)
        .run(30); // 30 minutes timeout
 
    console.log("Test Run Link:", result.testRunResult.link);
    console.log("Failed Entries:", result.failedEntryIndices);
    console.log("Evaluation Results:", result.testRunResult.result);
    /*
    the result.testRunResult.result object looks like this (values are mock data):
    {
        cost: {
            input: 1.905419538506091,
            completion: 2.010163610111029,
            total: 3.915583148617119
        },
        latency: {
            min: 6,
            max: 484.5761906393187,
            p50: 438,
            p90: 484,
            p95: 484,
            p99: 484,
            mean: 346.2,
            standardDeviation: 179.4284,
            total: 5
        },
        name: 'sdk test run 1734931207308',
        usage: { completion: 206, input: 150, total: 356 },
        individualEvaluatorMeanScore: {
            Faithfulness: { score: 0, outOf: 1 },
            'Answer Relevance': { score: 0.2, outOf: 1 },
        }
    }
    */
} catch (error) {
    console.error("Test Run Failed:", error);
} finally {
    await maxim.cleanup();
}

On this page