Automate workflow evaluation via CI/CD

The following builds upon Evaluate Prompts -> Automate prompt evaluation via CI/CD. Please refer to it if you haven't already.

Pre-requisites

Apart from the pre-requisites mentioned in Evaluate Prompts -> Automate prompt evaluation via CI/CD, you also need A workflow to test upon

Pre-requisites mentioned earlier that you need:

Test runs via CLI

Apart from what was introduced earlier, you can use the -w flag in place of -p to specify the workflow to test upon.

Installation

Use the following command template to install the CLI tool (if you are using Windows, please refer to the Windows example as well):

wget https://downloads.getmaxim.ai/cli/<VERSION>/<OS>/<ARCH>/maxim

For more please refer to Evaluate Prompts -> Automate prompt evaluation via CI/CD -> Test runs via CLI

Triggering a test run

Use this template to trigger a test run:

Shell command to trigger a test run

# If you haven't added the binary to your PATH,
# replace `maxim` with the path to the binary you just downloaded
maxim test -w <workflow_id> -d <dataset_id> -e <comma_separated_evaluator_names>

Here are the arguments/flags that you can pass to the CLI to configure your test run

Argument / Flag	Description
-w	Workflow ID or IDs; in case you send multiple IDs (comma separated), it will create a comparison run.
-d	Dataset ID
-e	Comma separated evaluator names Ex. `bias,clarity`
--json	(optional) Output the result in JSON format

Test runs via GitHub Action

Apart from what was introduced earlier, you can use the workflow_id "with parameter" in place of prompt_version_id to specify the workflow to test upon.

Quick Start

In order to add the GitHub Action to your workflow, you can start by adding a step that uses maximhq/actions/test-runs@v1 as follows:

Please ensure that you have the following setup:

in GitHub action secrets
- MAXIM_API_KEY
in GitHub action variables
- WORKSPACE_ID
- DATASET_ID
- WORKFLOW_ID

.github/workflows/test-runs.yml

name: Run Test Runs with Maxim
 
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
 
env:
  TEST_RUN_NAME: "Test Run via GitHub Action"
  CONTEXT_TO_EVALUATE: "context"
  EVALUATORS: "bias, clarity, faithfulness"
 
jobs:
  test_run:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Repository
        uses: actions/checkout@v2
      - name: Running Test Run
        id: test_run
        uses: maximhq/actions/test-runs@v1
        with:
          api_key: ${{ secrets.MAXIM_API_KEY }}
          workflow_id: ${{ vars.WORKFLOW_ID }}
          test_run_name: ${{ env.TEST_RUN_NAME }}
          dataset_id: ${{ vars.DATASET_ID }}
          workflow_id: ${{ vars.WORKFLOW_ID }}
          context_to_evaluate: ${{ env.CONTEXT_TO_EVALUATE }}
          evaluators: ${{ env.EVALUATORS }}
      - name: Display Test Run Results
        if: success()
        run: |
          printf '%s\n' '${{ steps.test_run.outputs.test_run_result }}'
          printf '%s\n' '${{ steps.test_run.outputs.test_run_failed_indices }}'
          echo 'Test Run Report URL: ${{ steps.test_run.outputs.test_run_report_url }}'

This will trigger a test run on the platform and wait for it to complete before proceeding. The progress of the test run will be displayed in the Running Test Run section of the GitHub Action's logs as displayed below:

Inputs

The following are the inputs that can be used to configure the GitHub Action:

Name	Description	Required
`api_key`	Maxim API key	Yes
`workspace_id`	Workspace ID to run the test run in	Yes
`test_run_name`	Name of the test run	Yes
`dataset_id`	Dataset ID for the test run	Yes
`workflow_id`	Workflow ID to run for the test run (do not use with `prompt_version_id`)	Yes (No if `prompt_version_id` is provided)
`prompt_version_id`	Prompt version ID to run for the test run (discussed in Evaluate Prompts -> Automate prompt evaluation via CI/CD, do not use with `workflow_id`)	Yes (No if `workflow_id` is provided)
`context_to_evaluate`	Variable name to evaluate; could be any variable used in the workflow / prompt or a column name	No
`evaluators`	Comma separated list of evaluator names	No
`human_evaluation_emails`	Comma separated list of emails to send human evaluations to	No (required in case there is a human evaluator in `evaluators`)
`human_evaluation_instructions`	Overall instructions for human evaluators	No
`concurrency`	Maximum number of concurrent test run entries running	No (defaults to 10)
`timeout_in_minutes`	Fail if test run overall takes longer than this many minutes	No (defaults to 15 minutes)

Outputs

The outputs that are provided by the GitHub Action in case it doesn't fail are:

Name	Description
`test_run_result`	Result of the test run
`test_run_report_url`	URL of the test run report
`test_run_failed_indices`	Indices of failed test run entries

Evaluating Prompts

Please refer to Evaluate Prompts -> Automate prompt evaluation via CI/CD

Automate workflow evaluation via CI/CD

On this page