Trigger Test Runs using SDK

While Maxim’s web interface provides a powerful way to run tests, the SDK offers even more flexibility and control. With the SDK, you can:

Use custom datasets directly from your code
Control how outputs are generated
Integrate testing into your CI/CD pipeline
Get real-time feedback on test progress
Handle errors programmatically

Getting Started: Basic SDK Example

import { Maxim } from "@maximai/maxim-js";

const maxim = new Maxim({ apiKey: "" });

const result = await maxim
.createTestRun("My First SDK Test", "your-workspace-id")
.withDataStructure(/* your data structure here */)
.withData(/* your data here */)
.yieldsOutput(/* your output function here */)
.withWorkflowId(/* or you can pass workflow ID from Maxim platform */)
.withPromptVersionId(/* or you can pass prompt version ID from Maxim platform */)
.withEvaluators(/* your evaluators here */)
.run();

Copy your workspace ID from the workspace switcher in the left topbar

Data Structure Configuration

Understand the data structure to maintain type safety and validate data columns. It maps your data columns to specific types that Maxim understands.

Basic structure

Define your data structure using an object that maps column names to specific types.

const dataStructure = {
    myQuestionColumn: "INPUT",
    expectedAnswerColumn: "EXPECTED_OUTPUT",
    contextColumn: "CONTEXT_TO_EVALUATE",
    additionalDataColumn: "VARIABLE"
}

Available types

INPUT - Main input text (only one allowed)
EXPECTED_OUTPUT - Expected response (only one allowed)
CONTEXT_TO_EVALUATE - Context for evaluation (only one allowed)
VARIABLE - Additional data columns (multiple allowed)
NULLABLE_VARIABLE - Optional data columns (multiple allowed)

Implementation Example

import { Maxim } from "@maximai/maxim-js";

const maxim = new Maxim({ apiKey: "YOUR_API_KEY" });

const result = maxim
    .createTestRun("Question Answering Test", workspaceId)
    .withDataStructure({
        question: "INPUT",
        answer: "EXPECTED_OUTPUT",
        context: "CONTEXT_TO_EVALUATE",
        metadata: "NULLABLE_VARIABLE"
    })
    // ... rest of the configuration

Data Source Integration

Maxim’s SDK supports multiple ways to provide test data:

1. Using Callable Data Source

eg. Data Source is a Local CSV file

import { CSVFile, Maxim } from '@maximai/maxim-js';

const myCSVFile = new CSVFile('./test.csv', {
    question: 0, // column index in CSV
    answer: 1,
    context: 2
});

const maxim = new Maxim({ apiKey: "YOUR_API_KEY" });

const result = maxim
    .createTestRun("CSV Test Run", workspaceId)
    .withDataStructure({
        question: "INPUT",
        answer: "EXPECTED_OUTPUT",
        context: "CONTEXT_TO_EVALUATE"
    })
    .withData(myCSVFile)
    // ... rest of the configuration

The CSVFile class automatically validates your CSV headers against the data structure and provides type-safe access to your data.

2. Manual data array

For smaller datasets or programmatically generated data:

import { Maxim } from "@maximai/maxim-js";

const maxim = new Maxim({ apiKey: "YOUR_API_KEY" });

const manualData = [
    {
        question: "What is the capital of France?",
        answer: "Paris",
        context: "France is a country in Western Europe..."
    },
    {
        question: "Who wrote Romeo and Juliet?",
        answer: "William Shakespeare",
        context: "William Shakespeare was an English playwright..."
    }
];

const result = maxim
    .createTestRun("Manual Data Test", workspaceId)
    .withDataStructure({
        question: "INPUT",
        answer: "EXPECTED_OUTPUT",
        context: "CONTEXT_TO_EVALUATE"
    })
    .withData(manualData)
    // ... rest of the configuration

3. Platform datasets

Use existing datasets from your Maxim workspace:

import { Maxim } from "@maximai/maxim-js";

const maxim = new Maxim({ apiKey: "YOUR_API_KEY" });

const result = maxim
    .createTestRun("Platform Dataset Test", workspaceId)
    .withDataStructure({
        question: "INPUT",
        answer: "EXPECTED_OUTPUT",
        context: "CONTEXT_TO_EVALUATE"
    })
    .withData("your-dataset-id")
    // ... rest of the configuration

Platform Integration: Testing with Workflows

import { Maxim } from "@maximai/maxim-js";

const maxim = new Maxim({ apiKey: "YOUR_API_KEY" });

const result = maxim
    .createTestRun("Custom Output Test", workspaceId)
    .withDataStructure({
        question: "INPUT",
        answer: "EXPECTED_OUTPUT",
        context: "CONTEXT_TO_EVALUATE"
    })
    .withData(myData)
    .withWorkflowId(workflowIdFromDashboard, contextToEvaluate) // context to evaluate is optional; it can either be a variable used in the workflow or a column name present in the dataset

Find the workflow ID in the workflows tab and from menu click on copy ID.

Platform Integration: Testing with Prompt Versions

import { Maxim } from "@maximai/maxim-js";

const maxim = new Maxim({ apiKey: "YOUR_API_KEY" });

const result = maxim
    .createTestRun("Custom Output Test", workspaceId)
    .withDataStructure({
        question: "INPUT",
        answer: "EXPECTED_OUTPUT",
        context: "CONTEXT_TO_EVALUATE"
    })
    .withData(myData)
    .withPromptVersionId(promptVersionIdFromPlatform, contextToEvaluate) // context to evaluate is optional; it can either be a variable used in the prompt or a column name present in the dataset

To get prompt version ID, go to prompts tab, select the version you want to run tests on and from menu click on copy version id.

Output Function Configuration

import { Maxim } from "@maximai/maxim-js";

const maxim = new Maxim({ apiKey: "YOUR_API_KEY" });

const result = maxim
    .createTestRun("Custom Output Test", workspaceId)
    .withDataStructure({
        question: "INPUT",
        answer: "EXPECTED_OUTPUT",
        context: "CONTEXT_TO_EVALUATE"
    })
    .withData(myData)
    .yieldsOutput(async (data) => {
        // Call your model or API
        const response = await yourModel.call(
            data.question,
            data.context
        );

        return {
            // Required: The actual output
            data: response.text,

            // Optional: Context used for evaluation
            // Returning a value here will utilize this context for
            // evaluation instead of the CONTEXT_TO_EVALUATE column (if provided)
            retrievedContextToEvaluate: response.relevantContext,

            // Optional: Performance metrics
            meta: {
                usage: {
                    promptTokens: response.usage.prompt_tokens,
                    completionTokens: response.usage.completion_tokens,
                    totalTokens: response.usage.total_tokens,
                    latency: response.latency
                },
                cost: {
                    input: response.cost.input,
                    output: response.cost.output,
                    total: response.cost.input + response.cost.output
                }
            }
        };
    })

If your output function throws an error, the entry will be marked as failed and you’ll receive the index in the failed_entry_indices array after the run completes.

Evaluator Configuration

Choose which evaluators to use for your test run:

import { Maxim } from "@maximai/maxim-js";

const maxim = new Maxim({ apiKey: "YOUR_API_KEY" });

const result = maxim
    .createTestRun("Evaluated Test", workspaceId)
    // ... previous configuration
    .withEvaluators(
        "Faithfulness", // names of evaluators installed in your workspace
        "Semantic Similarity",
        "Answer Relevance"
    )

Human Evaluation Setup

import { Maxim } from "@maximai/maxim-js";

const maxim = new Maxim({ apiKey: "YOUR_API_KEY" });

const result = maxim
    .createTestRun("Human Evaluated Test", workspaceId)
    // ... previous configuration
    .withEvaluators("Human Evaluator")
    .withHumanEvaluationConfig({
        emails: ["[email protected]"],
        instructions: "Please evaluate the response according to the evaluation criteria"
    })

Custom Evaluator Implementation

import {
    Maxim,
    createDataStructure,
    createCustomEvaluator,
    createCustomCombinedEvaluatorsFor,
} from "@maximai/maxim-js";

const maxim = new Maxim({
    apiKey: process.env.MAXIM_API_KEY
});

const dataStructure = createDataStructure({
    Input: 'INPUT',
    'Expected Output': 'EXPECTED_OUTPUT',
    stuff: 'CONTEXT_TO_EVALUATE',
});

// example of creating a custom evaluator
const myCustomEvaluator = createCustomEvaluator<typeof dataStructure>(
    'apostrophe-checker',
    (result) => {
        if (result.output.includes("'")) {
            return {
                score: true,
                reasoning: 'The output contains an apostrophe',
            };
        } else {
            return {
                score: false,
                reasoning: 'The output does not contain an apostrophe',
            };
        }
    },
    {
        onEachEntry: {
            scoreShouldBe: '=',
            value: true,
        },
        forTestrunOverall: {
            overallShouldBe: '>=',
            value: 80,
            for: 'percentageOfPassedResults',
        },
    },
);

// example of creating a combined custom evaluator
const myCombinedCustomEvaluator = createCustomCombinedEvaluatorsFor(
    'apostrophe-checker-2',
    'containsSpecialCharacters',
).build<typeof dataStructure>(
    (result) => {
        return {
            'apostrophe-checker-2': {
                score: result.output.includes("'") ? true : false,
                reasoning: result.output.includes("'")
                    ? 'The output contains an apostrophe'
                    : 'The output does not contain an apostrophe',
            },
            containsSpecialCharacters: {
                score: result.output
                    .split('')
                    .filter((char) => /[!@#$%^&*(),.?"':{}|<>]/.test(char))
                    .length,
            },
        };
    },
    {
        'apostrophe-checker-2': {
            onEachEntry: {
                scoreShouldBe: '=',
                value: true,
            },
            forTestrunOverall: {
                overallShouldBe: '>=',
                value: 80,
                for: 'percentageOfPassedResults',
            },
        },
        containsSpecialCharacters: {
            onEachEntry: {
                scoreShouldBe: '>',
                value: 3,
            },
            forTestrunOverall: {
                overallShouldBe: '>=',
                value: 80,
                for: 'percentageOfPassedResults',
            },
        },
    },
);

Using Custom Evaluators

import { Maxim } from "@maximai/maxim-js";

const maxim = new Maxim({ apiKey: "YOUR_API_KEY" });

const result = await maxim
    .createTestRun(`sdk test run ${Date.now()}`, payload.workspaceId)
    .withEvaluators(
        // platform evaluators
        'Faithfulness',
        'Semantic Similarity',
        // custom evaluators
        myCustomEvaluator,
        myCombinedCustomEvaluator,
    )
    .run();

Advanced Configuration

Concurrency Control

import { Maxim } from "@maximai/maxim-js";

const maxim = new Maxim({ apiKey: "YOUR_API_KEY" });

const result = await maxim
    .createTestRun("Long Test", workspaceId)
    // ... previous configuration
    .withConcurrency(5); // Process 5 entries at a time

Timeout Configuration

import { Maxim } from "@maximai/maxim-js";

const maxim = new Maxim({ apiKey: "YOUR_API_KEY" });

const result = await maxim
    .createTestRun("Long Test", workspaceId)
    // ... previous configuration
    .run(120) // Wait up to 120 minutes

Complete Implementation Example

Here’s a complete example combining all the features:

import { CSVFile, Maxim } from '@maximai/maxim-js';

const maxim = new Maxim({ apiKey: "YOUR_API_KEY" });

// Initialize your data source
const testData = new CSVFile('./qa_dataset.csv', {
    question: 0,
    expected_answer: 1,
    context: 2,
    metadata: 3
});

try {
    const result = await maxim
        .createTestRun(`QA Evaluation ${new Date().toISOString()}`, 'your-workspace-id')
        .withDataStructure({
            question: "INPUT",
            expected_answer: "EXPECTED_OUTPUT",
            context: "CONTEXT_TO_EVALUATE",
            metadata: "NULLABLE_VARIABLE"
        })
        .withData(testData)
        .yieldsOutput(async (data) => {
            const startTime = Date.now();

            // Your model call here
            const response = await yourModel.generateAnswer(
                data.question,
                data.context
            );

            const latency = Date.now() - startTime;

            return {
                data: response.answer,
                // Returning a value here will utilize this context for
                // evaluation instead of the CONTEXT_TO_EVALUATE column
                // (in this case, the `context` column)
                retrievedContextToEvaluate: response.retrievedContext,
                meta: {
                    usage: {
                        promptTokens: response.tokens.prompt,
                        completionTokens: response.tokens.completion,
                        totalTokens: response.tokens.total,
                        latency
                    },
                    cost: {
                        input: response.cost.prompt,
                        output: response.cost.completion,
                        total: response.cost.total
                    }
                }
            };
        })
        .withEvaluators(
            "Faithfulness",
            "Answer Relevance",
            "Human Evaluator"
        )
        .withHumanEvaluationConfig({
            emails: ["[email protected]"],
            instructions: `Please evaluate the responses for accuracy and completeness. Consider both factual correctness and answer format.`
        })
        .withConcurrency(10)
        .run(30); // 30 minutes timeout

    console.log("Test Run Link:", result.testRunResult.link);
    console.log("Failed Entries:", result.failedEntryIndices);
    console.log("Evaluation Results:", result.testRunResult.result[0]);
    /*
    the result.testRunResult.result[0] object looks like this (values are mock data):
    {
        cost: {
            input: 1.905419538506091,
            completion: 2.010163610111029,
            total: 3.915583148617119
        },
        latency: {
            min: 6,
            max: 484.5761906393187,
            p50: 438,
            p90: 484,
            p95: 484,
            p99: 484,
            mean: 346.2,
            standardDeviation: 179.4284,
            total: 5
        },
        name: 'sdk test run 1734931207308',
        usage: { completion: 206, input: 150, total: 356 },
        individualEvaluatorMeanScore: {
            Faithfulness: { score: 0, outOf: 1 },
            'Answer Relevance': { score: 0.2, outOf: 1 },
        }
    }
    */
} catch (error) {
    console.error("Test Run Failed:", error);
} finally {
    await maxim.cleanup();
}

Introduction

Evaluate

Observe

Analyze

Library

Trigger Test Runs using SDK

Getting Started: Basic SDK Example

Data Structure Configuration

Basic structure

Available types

Implementation Example

Data Source Integration

1. Using Callable Data Source

2. Manual data array

3. Platform datasets

Platform Integration: Testing with Workflows

Platform Integration: Testing with Prompt Versions

Output Function Configuration

Evaluator Configuration

Human Evaluation Setup

Custom Evaluator Implementation

Using Custom Evaluators

Advanced Configuration

Concurrency Control

Timeout Configuration

Complete Implementation Example

Introduction

Evaluate

Observe

Analyze

Library

​Getting Started: Basic SDK Example

​Data Structure Configuration

​Basic structure

​Available types

​Implementation Example

​Data Source Integration

​1. Using Callable Data Source

​2. Manual data array

​3. Platform datasets

​Platform Integration: Testing with Workflows

​Platform Integration: Testing with Prompt Versions

​Output Function Configuration

​Evaluator Configuration

​Human Evaluation Setup

​Custom Evaluator Implementation

​Using Custom Evaluators

​Advanced Configuration

​Concurrency Control

​Timeout Configuration

​Complete Implementation Example

Getting Started: Basic SDK Example

Data Structure Configuration

Basic structure

Available types

Implementation Example

Data Source Integration

1. Using Callable Data Source

2. Manual data array

3. Platform datasets

Platform Integration: Testing with Workflows

Platform Integration: Testing with Prompt Versions

Output Function Configuration

Evaluator Configuration

Human Evaluation Setup

Custom Evaluator Implementation

Using Custom Evaluators

Advanced Configuration

Concurrency Control

Timeout Configuration

Complete Implementation Example