Maxim Logo
How toEvaluate Prompts

Automate Prompt evaluation via CI/CD

Trigger test runs in CI/CD pipelines to evaluate prompts automatically.

Ensuring quality with every deployment

AI applications today are shipping lightning fast and there are a lot of iterations / changes being made to the system. Moving as fast means having to break a lot; but it doesn't necessarily have to. A good practice is to ensure that none of previous functionality breaks with the new changes.

This is where Maxim's CI/CD integration can help you out. By integrating a step to run a Test run into your deployment workflow, you can ensure that every new deployment is meeting a certain baseline quality.

Before we start

Triggering a test run from CI/CD requires you have the following setup and ready:

  1. An API key from Maxim
  2. A prompt version to test upon
  3. A dataset to test against
  4. Evaluators to evaluate the prompt version against the dataset

Now that we have all the prerequisites, test runs can be triggered via:

Test runs via CLI

Maxim offers a CLI tool that can be used to run test runs. It is an executable binary that can be run from the command line.

Installation

Use the following command template to install the CLI tool (if you are using Windows, please refer to the Windows example as well):

wget https://downloads.getmaxim.ai/cli/<VERSION>/<OS>/<ARCH>/maxim

We currently only have v1 of the CLI tool. So please replace <VERSION> with v1.

Supported OS + ARCH

The command template can have the following OS + Architectures values,

OSARCH
Linuxamd64
Linux386
Darwinarm64
Darwin386
Windowsamd64
Windows386

Env Variables

We require you to set these environment variables before using the CLI, these values cannot be passed via arguments/flags.

NameValue
MAXIM_API_KEYAPI key obtained via Maxim

Triggering a test run

Use this template to trigger a test run:

Shell command to trigger a test run
# If you haven't added the binary to your PATH,
# replace `maxim` with the path to the binary you just downloaded
maxim test -p <prompt_version_id> -d <dataset_id> -e <comma_separated_evaluator_names>

Here are the arguments/flags that you can pass to the CLI to configure your test run

Argument / FlagDescription
-pPrompt version ID or IDs; in case you send multiple IDs (comma separated), it will create a comparison run.
-dDataset ID
-eComma separated evaluator names
Ex. bias,clarity
--json(optional) Output the result in JSON format

Test runs via GitHub Action

GitHub actions provide a powerful way to run tests, build, and deploy your application. Our GitHub Action can seamlessly integrate with your existing deployment workflows, allowing you to ensure that your LLM is functioning as you expect.

Quick Start

In order to add the GitHub Action to your workflow, you can start by adding a step that uses maximhq/actions/test-runs@v1 as follows:

Please ensure that you have the following setup:

  • in GitHub action secrets
    • MAXIM_API_KEY
  • in GitHub action variables
    • WORKSPACE_ID
    • DATASET_ID
    • PROMPT_VERSION_ID
.github/workflows/test-runs.yml
name: Run Test Runs with Maxim
 
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
 
env:
  TEST_RUN_NAME: "Test Run via GitHub Action"
  CONTEXT_TO_EVALUATE: "context"
  EVALUATORS: "bias, clarity, faithfulness"
 
jobs:
  test_run:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Repository
        uses: actions/checkout@v2
      - name: Running Test Run
        id: test_run
        uses: maximhq/actions/test-runs@v1
        with:
          api_key: ${{ secrets.MAXIM_API_KEY }}
          prompt_version_id: ${{ vars.PROMPT_VERSION_ID }}
          test_run_name: ${{ env.TEST_RUN_NAME }}
          dataset_id: ${{ vars.DATASET_ID }}
          workflow_id: ${{ vars.WORKFLOW_ID }}
          context_to_evaluate: ${{ env.CONTEXT_TO_EVALUATE }}
          evaluators: ${{ env.EVALUATORS }}
      - name: Display Test Run Results
        if: success()
        run: |
          printf '%s\n' '${{ steps.test_run.outputs.test_run_result }}'
          printf '%s\n' '${{ steps.test_run.outputs.test_run_failed_indices }}'
          echo 'Test Run Report URL: ${{ steps.test_run.outputs.test_run_report_url }}'

This will trigger a test run on the platform and wait for it to complete before proceeding. The progress of the test run will be displayed in the Running Test Run section of the GitHub Action's logs as displayed below:

GitHub Action Running Test Run Logs

Inputs

The following are the inputs that can be used to configure the GitHub Action:

NameDescriptionRequired
api_keyMaxim API keyYes
workspace_idWorkspace ID to run the test run inYes
test_run_nameName of the test runYes
dataset_idDataset ID for the test runYes
prompt_version_idPrompt version ID to run for the test run (do not use with workflow_id)Yes (No if workflow_id is provided)
workflow_idWorkflow ID to run for the test run
(discussed in Evaluate Workflows via API -> Automate workflow evaluation via CI/CD, do not use with prompt_version_id)
Yes (No if prompt_version_id is provided)
context_to_evaluateVariable name to evaluate; could be any variable used in the workflow / prompt or a column nameNo
evaluatorsComma separated list of evaluator namesNo
human_evaluation_emailsComma separated list of emails to send human evaluations toNo (required in case there is a human evaluator in evaluators)
human_evaluation_instructionsOverall instructions for human evaluatorsNo
concurrencyMaximum number of concurrent test run entries runningNo (defaults to 10)
timeout_in_minutesFail if test run overall takes longer than this many minutesNo (defaults to 15 minutes)

Outputs

The outputs that are provided by the GitHub Action in case it doesn't fail are:

NameDescription
test_run_resultResult of the test run
test_run_report_urlURL of the test run report
test_run_failed_indicesIndices of failed test run entries

Evaluating Workflows

Please refer to Evaluate Workflows via API -> Automate workflow evaluation via CI/CD

On this page