Evaluator type | Description |
---|---|
AI | Uses AI models to assess outputs |
Programmatic | Applies predefined rules or algorithms |
Statistical | Utilizes statistical methods for evaluation |
Human | Involves human judgment and feedback |
API-based | Leverages external APIs for assessment |
validate
.validate
.validate
.validate
.validate
.validate
.validate
.validate
.validate
.validate
.validate
.validate
.validate
.validate
.validate
.validate
.distance = 1 - cosine\_similarity
. Lower scores (closer to 0) indicate higher similarityTurn | Content | Purpose |
---|---|---|
Initial prompt | You are a helpful AI assistant specialized in geography. | Sets the context for the interaction (optional, model-dependent) |
User input | What's the capital of France? | The first query for the AI to respond to |
Model response | The capital of France is Paris. | The model's response to the first query |
User input | What's its population? | A follow-up question, building on the previous context |
Model response | As of 2023, the estimated population of Paris is about 2.2 million people in the city proper. | The model's response to the follow-up question |
Evaluator type | Description |
---|---|
AI | Uses AI models to assess outputs |
Programmatic | Applies predefined rules or algorithms |
Statistical | Utilizes statistical methods for evaluation |
Human | Involves human judgment and feedback |
API-based | Leverages external APIs for assessment |
You can find more about our evaluators [here](/library/evaluators/pre-built-evaluators). |
`{{context}}`
Include at least one direct quote from the context, enclosed in quotation marks, and specify the section and page number where the quote can be found.
Ensure the response is friendly and polite, adding "please" at the end to maintain a courteous tone.
`{{context}}`
variable.
`{{context}}`
variable), and the LLM generates a response for our input query using the information in the retrieved context.
To evaluate the performance of our assistant, we'll now create a test [dataset](/library/datasets/import-or-create-datasets). It is a collection of employee queries and corresponding expected responses. We'll use the expected response to evaluate the performance and quality of the response generated by our assistant.
### Step 2: Create a Dataset
For our example, we'll use the [HR\_queries.csv](https://docs.google.com/spreadsheets/d/1nEEWlw7BeGSahJMRk26nNTG_s6y9jJ2HIFAIB79KmkM/edit?usp=sharing) dataset.
1. To upload the dataset to Maxim, go to the "Library" section and select "Datasets"
2. Click the "+" button and upload a CSV file as a dataset
3. Map the columns in the following manner:
* Set `employee_query` as "Input" type, since these queries will be the input to our HR assistant
* Set `expected_response` as "Expected Output" type, since this is the reference for comparison of generated assistant responses
4. Click "Add to dataset" and your evaluation dataset is ready to use
### Step 3: Evaluating the HR Assistant
Now we'll evaluate the performance of our HR assistant and the quality of the generated responses.
We'll evaluate the performance using the following evaluators from Maxim's [Evaluator Store](/library/evaluators/pre-built-evaluators):
| Evaluator | Type | Purpose |
| ------------------- | -------------- | -------------------------------------------------------------------------------------------- |
| Context Relevance | LLM-as-a-judge | Evaluates how well your RAG pipeline's retriever finds information relevant to the input |
| Faithfulness | LLM-as-a-judge | Measures whether the output factually aligns with the contents of your context |
| Context Precision | LLM-as-a-judge | Measures retriever accuracy by assessing the relevance of each node in the retrieved context |
| Bias | LLM-as-a-judge | Determines whether output contains gender, racial, political, or geographical bias |
| Semantic Similarity | Statistical | Checks whether the generated output is semantically similar to the expected output |
| Tone check | Custom eval | Determines whether the output has friendly and polite tone |
**Tone check**: To check the tone of our HR assistant's responses, we'll also create a custom LLM-as-a-Judge evaluator on Maxim. We'll define the following instructions for our judge LLM to evaluate the tone:
`{{output}}`
, determine if the response is friendly and polite?
`{{query}}`
.
createTestRun
is the main function that creates a test run. It takes the name of the test run and the workspace id.
* withDataStructure
is used to define the data structure of the dataset. It takes an object with the keys as the column names and the values as the column types.
* withData
is used to specify the dataset to use for the test run. Can be a datasetId(string), a CSV file, an array of column to value mappings.
* withEvaluators
is used to specify the evaluators to use/attach for the test run. You may create an evaluator locally through code or use an evaluator that is installed in your workspace through the name directly
* withPromptVersionId
is used to specify the prompt version to use for the test run. It takes the id of the prompt version.
* run
is used to execute the test run.
## Next Steps
Now that you've run your first prompt evaluation, explore these guides:
* [Local Prompt Testing](/offline-evals/via-sdk/prompts/local-prompt) - Learn how to test prompts with custom logic
* [Maxim Prompt Testing](/offline-evals/via-sdk/prompts/maxim-prompt) - Use prompts stored on the Maxim platform
* [Prompt Management](/offline-evals/via-sdk/prompts/prompt-management) - Retrieve and use prompts in production workflows
* [CI/CD Integration](/offline-evals/via-sdk/prompts/ci-cd-integration) - Automate prompt testing in your CI/CD pipeline
# Customized Reports
Source: https://www.getmaxim.ai/docs/offline-evals/via-ui/advanced/customized-reports
The run report is a single source of truth for you to understand exactly how your AI system is performing during your experiments or pre-release testing. You can customize reports to gain insights and make decisions.
## Toggle columns
For prompt/workflow runs, by default we only show the input from the dataset and the retrieved context (if applicable) and output from the run. However, there might be cases where you want to see other dataset columns to analyze the output. Similarly, you may want to hide some already visible columns in order to see limited data while analyzing evaluations. To show/hide columns, follow the below steps:
Composio
client.
Settings
β MCP Clients
and click Add new client
.
Prompts
. You will find all the tools you have set up under Prompt Tools
.
Evaluate captured logs automatically from the UI based on filters and sampling
Use human evaluation or rating to assess the quality of your logs and evaluate them.
Evaluate any component of your trace or log to gain insights into your agentβs behavior.
maxim.getVaultVariable
in the scripts of your agents, API evaluators etc. The video below shows how to use the vault variables in different ways on the platform.