Create consistent evaluation workflows for AI agents

When teams build AI agents, it is easy to get excited about making the agents run tasks, respond to prompts, or connect tools. However, success is not only about getting the agent to act, success is about ensuring the agent acts correctly every time.

That is where AI agent evaluation plays a critical role.

Evaluation checks how well an AI agent performs in real-world conditions, with real users, real data, and unpredictable situations. Evaluation helps answer important questions such as:

Is the AI agent responding accurately?
Is the AI agent choosing the correct tools and steps?
Are the AI agent’s outputs reliable and useful?

With proper evaluation, teams shift from it works for now to it works when it matters. Evaluation helps identify gaps, measure performance, and improve quality over time. Most importantly, evaluation builds trust. Evaluation turns an early prototype into a dependable assistant.

Evaluation is not the final step. Evaluation is an ongoing commitment to ensure that every AI decision and every AI action adds meaningful value.

Agent predeployment lifecycle

The AI agent development process is iterative and follows four key stages:

pre_deployment_lifecycle

Develop: Build and configure the AI agent.
Evaluate: Test the AI agent and measure its performance against reference data.
Analyze: Review the AI agent’s behavior and identify issues or patterns.
Improve: Refine the AI agent based on the findings from the evaluation.

Evaluation framework in watsonx Orchestrate

The evaluation framework acts like a coach for the AI agent. After the AI agent starts interacting and making decisions, the evaluation framework examines how well the AI agent performs. It compares the AI agent’s simulated actions, called trajectories, with a set of reference examples that represent strong and reliable performance.

This comparison helps development teams identify where the AI agent performs well and where the AI agent requires improvement. The evaluation framework also provides tools to create and manage reference data, making it easier to refine the AI agent throughout the development process. The evaluation framework guides the AI agent from an early prototype to a consistent and dependable performer.

How the evaluation process works

The evaluation framework follows these steps:

User story creation: Write clear user stories that describe the goal, the context, and the AI agent involved. Each user story defines what a successful interaction should look like.
Simulation: Create test cases for each user story to prepare the evaluation framework for execution.
User agent interaction: The user agent, which is powered by a large language model (LLM), sends messages to the target agent by using the context provided in the user story.
Response comparison: Compare the target agent’s response with the expected response defined in the user story.
Success criteria check: If the responses match, the user story passes. If the responses do not match, the user story fails.

In this tutorial, you learn how to set up an evaluation framework to test and benchmark your AI agents for a banking use case. This tutorial covers:

Evaluating AI agents and tools by using generated user stories or by recording user interactions.
Testing the AI agent in the watsonx Orchestrate platform.

Prerequisites

You must have a local environment of the IBM watsonx Orchestrate Agent Development Kit (ADK) running. If not, follow the Getting Started with ADK tutorial.
This tutorial has been tested on watsonx Orchestrate ADK versions 1.13.0 and 1.14. Use version 1.12.0, or later for SaaS and on-premises support.
You must have Python version 3.10 or later installed.
Ensure that the ADK CLI is installed (pip install watsonx-orchestrate-adk).
If you are using the SaaS environment, ensure that you are authenticated with the IBM Cloud CLI.

Step 1. Set up the ADK environment

Create an IBM Cloud API key.

Clone the GitHub repository and go to the wxo-evaluation-framework folder.

git clone https://github.com/IBM/oic-i-agentic-ai-tutorials/

cd wxo-evaluation-framework/
tree .

Check resources

This folder is your working directory. Open the folder in Visual Studio Code or in any editor that you prefer.

Step 2. Import tools and agents

Import the AI agents and the tools that you created in watsonx Orchestrate using the watsonx Orchestrate ADK.

Import all tools and agents.

chmod +x import-all.sh
./import-all.sh

Confirm that the tools and agents were imported correctly. You can also verify the imported resources in the watsonx Orchestrate user interface (UI).
```
orchestrate tools list
orchestrate agents list
```
If you are using the IBM SaaS environment:
- If you do not see the AI agents, make sure that you are viewing the correct watsonx Orchestrate instance.
- If the AI agents still do not appear, the import operation in the preceding step might have failed.
Use sample queries to verify that the AI agent responds correctly:
- My username is Alice. I want to find my current account balance.
- I want to list my last five transactions for my account.
- I want to update my registered email address to alex.new@abc.com
- I want to know whether I have sufficient balance to transfer 5000 USD.

Step 3. Create an evaluation dataset and run the evaluation

Create an evaluation dataset by choosing one of the following methods:

The generate command method: When the AI agent’s behavior is already known or clearly defined by business rules.
The record command method: To capture live interactions for evaluation.

The generate command method

This method automatically creates ground truth evaluation datasets (test cases) from user stories and tool definitions. This method helps you to build realistic and repeatable evaluation test cases for evaluating and benchmarking AI agents.

Flow: Create user stories → Generate the evaluation dataset → Run the evaluation on the evaluation dataset.

Create or view user stories.

a. In the current working directory wxo-evaluation-framework, open the file user_stories/banking_user_stories.csv.

b. This CSV file contains the user stories to be tested. The file includes the following columns:
- story: The user story.
- agent: The AI agent responsible for running the user story.
  
  You can edit this file and add more user stories as required.
  
  Best practices for writing user stories:
  - Keep user stories realistic and context-based.
  - Clearly state the goal and include all required inputs for the tools.
  - Include both successful and fallback cases when applicable.
  - Cover all tools and tool parameters.
  - Include positive and negative scenarios.
  - Ensure that the AI agent name is correct and has no extra spaces.
Generate test cases from user stories.

This step runs each user story end-to-end, follows the defined tool flow, and creates the expected outputs. All generated test cases are saved in the folder user_stories/banking_agent_test_cases/.

a. Run the following command in the terminal.
```
orchestrate evaluations generate --stories-path ./user_stories/banking_user_stories.csv --tools-path ./agent_tools/ --env-file .env
```
b. After the generation process completes, two test cases are created for each user story.

All generated test cases are saved in the folder that is shown on the left. The test cases are stored in JSON format and will be used in the next evaluation step.
Run the following command to start the evaluation of all test cases:
```
orchestrate evaluations evaluate --test-paths ./user_stories/banking_agent_test_cases/ --output-dir ./user_stories/test_execution/ --env-file .env
```
During the evaluation, the evaluation framework simulates a real conversation between a user and the target AI agent. A user agent, which is powered by a large language model (LLM), reads the user story, interprets the context and intent, and interacts with the target AI agent as a user would. This process helps you to verify whether the target AI agent selects the correct tools and produces the expected response.
After the evaluation completes, you can view the final evaluation report.

The metrics table shows several values for each user story, including:
- Total steps taken
- LLM steps
- Total tool calls
- Tool call precision
- Tool call recall
- Agent routing accuracy
- Text match
- Journey success (0% or 100%)
- Average response time
Use the Journey success value to identify which tools or AI agents require improvement. If a user story fails, review the associated tool or script and update it as needed.

You can view all evaluation metrics, simulated user interactions, and detailed analysis in the folder user_stories/test_execution/.
Analyze the test results.

a. Run the following command:
```
orchestrate evaluations analyze -d <test_execution_path> --env-file .env
```
Note: test_execution_path refers to the folder where the test results were generated in the previous step.

For example,: ./user_stories/test_execution/<execution_timestamp>

You can review each test case in detail. Use the analysis output to identify issues in AI agent behavior and plan improvements.

b. To inspect the tools and test cases in a complete and structured method, run the following command:
```
orchestrate evaluations analyze --mode enhanced -d <test_execution_path>--env-file .env
```
The enhanced report provides a clear summary whenever a tool call fails.
Go to Step 4. Test the agent in the watsonx Orchestrate platform.

The record command method

This method captures live chat sessions from the chat interface and converts the sessions into structured datasets. Each chat session is recorded in real time, and the data is saved in a separate annotated file for later analysis. You can use these datasets to compare AI agent behavior across different configuration setups, such as model types, agent descriptions, or parameter settings.

Flow: Record the user interaction → Run the evaluation by using the recorded dataset

Record the interaction.

a. Run the following command to start recording the interactions:
```
orchestrate evaluations record --output-dir ./chat_recordings --env-file .env
```
b. Recording is now active. All recorded interactions are saved in the chat_recordings folder.

c. Test the AI agent in the watsonx Orchestrate user interface

Use the following sample questions. These interactions will be saved in the chat_recordings folder after you stop the recording.
- My username is Alice. I want to find my current account balance.
- I want to list my last five transactions for my account.
- I want to update my registered email address to alex.new@abc.com
- I want to know whether I have sufficient balance to transfer 5000 USD.
Stop the recording.

a. When you complete your session, press Ctrl+C in the terminal where the record command is running. Make sure that the conversation is fully complete before stopping the recording so the dataset is not cut off.

b. You can view all chat recordings in the chat_recordings folder.

Note:
- The annotated data is generated automatically, so review the data before using it for evaluation. Remove any details that do not apply to your tests.
- The starting_sentence field is derived directly from your inputs, but fields such as story and goals are generated from the recorded conversation and might need validation.
Generate the evaluation report by using the recorded chat dataset.

a. Run the following command in the terminal:
```
orchestrate evaluations evaluate --test-paths <path_for_chat_recording_file>--output-dir <output_directory> --env-file .env
```
Example:
- <path_for_chat_recording_file> = ./chat_recordings/c491d30e-96f5-45a1-a428-8bc8b86d59d3
- <output_directory> = ./user_stories/test_execution_2
  
  b. You can view the evaluation report in the output directory path that you specified during the command execution.
Follow Step 2.5 to review the test results and understand the behavior of the evaluation tool.

Step 4. Test the agent in the watsonx Orchestrate platform

Use the built-in testing features for agents in watsonx Orchestrate. Create a simple test file with two columns.

Prompt: user input or query.
Answer: expected response.

For this tutorial, use the sample file banking_test_cases.csv at wxo-evaluation-framework/user_stories/banking_test_cases.csv

View file testing

Go to the IBM watsonx Orchestrate platform.
Open the Manage Agent page and select the specific agent that you want to test.
In the upper-right of the page, open the options menu and select Test.
On the Manage test cases and evaluations page, select Upload tests.
Download the banking_test_cases.csv file and upload the file to the platform.
When the test cases appear, select Run to start the evaluation.
When the evaluation completes and the status shows Completed, select the completed entry to view the evaluation metrics.
View the evaluation metrics.
- Answer quality metrics include Faithfulness, Relevance, and Correctness.
- Tool quality metrics include Accuracy and Relevance.
- Message completion metrics indicate Success or Failure.
For detailed information about these metrics, refer to the documentation Evaluating agents and tools. You can download the evaluation report by clicking the Download button.

Summary

This tutorial explained how to set up an evaluation framework to assess tools and agent behavior. The evaluation framework establishes a clear workflow and ensures consistent testing across the agent development lifecycle.

By implementing evaluation early in the agent development lifecycle, you create a data-driven feedback loop that improves agent performance and identifies opportunities to optimize tool behavior. Using the same evaluation steps and metrics helps standardize the development and testing process.

This approach provides actionable insights, supports consistent quality checks, and improves the overall robustness of AI agents before deployment.

Acknowledgments

This tutorial was produced as part of the IBM Open Innovation Community initiative: Agentic AI (AI for Developers and Ecosystem).

The authors deeply appreciate the support of of Ela Dixit, Monisankar Das, Ahmed Azraq, Moises Dominguez Garcia and Bindu Umesh for reviewing and contributing to this tutorial.