This is a cache of https://developer.ibm.com/tutorials/create-evaluation-workflow-ai-agents/. It is a snapshot of the page as it appeared on 2025-12-02T10:15:08.595+0000.
Create consistent evaluation workflows for AI agents - IBM Developer
When teams build AI agents, it is easy to get excited about making the agents run tasks, respond to prompts, or connect tools. However, success is not only about getting the agent to act, success is about ensuring the agent acts correctly every time.
Evaluation checks how well an AI agent performs in real-world conditions, with real users, real data, and unpredictable situations. Evaluation helps answer important questions such as:
Is the AI agent responding accurately?
Is the AI agent choosing the correct tools and steps?
Are the AI agent’s outputs reliable and useful?
With proper evaluation, teams shift from it works for now to it works when it matters. Evaluation helps identify gaps, measure performance, and improve quality over time. Most importantly, evaluation builds trust. Evaluation turns an early prototype into a dependable assistant.
Evaluation is not the final step. Evaluation is an ongoing commitment to ensure that every AI decision and every AI action adds meaningful value.
Agent predeployment lifecycle
The AI agent development process is iterative and follows four key stages:
Develop: Build and configure the AI agent.
Evaluate: Test the AI agent and measure its performance against reference data.
Analyze: Review the AI agent’s behavior and identify issues or patterns.
Improve: Refine the AI agent based on the findings from the evaluation.
Evaluation framework in watsonx Orchestrate
The evaluation framework acts like a coach for the AI agent. After the AI agent starts interacting and making decisions, the evaluation framework examines how well the AI agent performs. It compares the AI agent’s simulated actions, called trajectories, with a set of reference examples that represent strong and reliable performance.
This comparison helps development teams identify where the AI agent performs well and where the AI agent requires improvement. The evaluation framework also provides tools to create and manage reference data, making it easier to refine the AI agent throughout the development process. The evaluation framework guides the AI agent from an early prototype to a consistent and dependable performer.
How the evaluation process works
The evaluation framework follows these steps:
User story creation: Write clear user stories that describe the goal, the context, and the AI agent involved. Each user story defines what a successful interaction should look like.
Simulation: Create test cases for each user story to prepare the evaluation framework for execution.
User agent interaction: The user agent, which is powered by a large language model (LLM), sends messages to the target agent by using the context provided in the user story.
Response comparison: Compare the target agent’s response with the expected response defined in the user story.
Success criteria check: If the responses match, the user story passes. If the responses do not match, the user story fails.
In this tutorial, you learn how to set up an evaluation framework to test and benchmark your AI agents for a banking use case. This tutorial covers:
Evaluating AI agents and tools by using generated user stories or by recording user interactions.
Testing the AI agent in the watsonx Orchestrate platform.
Clone the GitHub repository and go to the wxo-evaluation-framework folder.
git clone https://github.com/IBM/oic-i-agentic-ai-tutorials/
cd wxo-evaluation-framework/
tree .
Copy codeCopied!
This folder is your working directory. Open the folder in Visual Studio Code or in any editor that you prefer.
Step 2. Import tools and agents
Import the AI agents and the tools that you created in watsonx Orchestrate using the watsonx Orchestrate ADK.
Import all tools and agents.
chmod +x import-all.sh
./import-all.sh
Copy codeCopied!
Confirm that the tools and agents were imported correctly. You can also verify the imported resources in the watsonx Orchestrate user interface (UI).
orchestrate tools list
orchestrate agents list
Copy codeCopied!
If you are using the IBM SaaS environment:
If you do not see the AI agents, make sure that you are viewing the correct watsonx Orchestrate instance.
If the AI agents still do not appear, the import operation in the preceding step might have failed.
Use sample queries to verify that the AI agent responds correctly:
My username is Alice. I want to find my current account balance.
I want to list my last five transactions for my account.
I want to update my registered email address to alex.new@abc.com
I want to know whether I have sufficient balance to transfer 5000 USD.
Step 3. Create an evaluation dataset and run the evaluation
Create an evaluation dataset by choosing one of the following methods:
The generate command method: When the AI agent’s behavior is already known or clearly defined by business rules.
The record command method: To capture live interactions for evaluation.
The generate command method
This method automatically creates ground truth evaluation datasets (test cases) from user stories and tool definitions. This method helps you to build realistic and repeatable evaluation test cases for evaluating and benchmarking AI agents.
Flow: Create user stories → Generate the evaluation dataset → Run the evaluation on the evaluation dataset.
Create or view user stories.
a. In the current working directory wxo-evaluation-framework, open the file user_stories/banking_user_stories.csv.
b. This CSV file contains the user stories to be tested. The file includes the following columns:
story: The user story.
agent: The AI agent responsible for running the user story.
You can edit this file and add more user stories as required.
Best practices for writing user stories:
Keep user stories realistic and context-based.
Clearly state the goal and include all required inputs for the tools.
Include both successful and fallback cases when applicable.
Cover all tools and tool parameters.
Include positive and negative scenarios.
Ensure that the AI agent name is correct and has no extra spaces.
Generate test cases from user stories.
This step runs each user story end-to-end, follows the defined tool flow, and creates the expected outputs. All generated test cases are saved in the folder user_stories/banking_agent_test_cases/.
b. After the generation process completes, two test cases are created for each user story.
All generated test cases are saved in the folder that is shown on the left. The test cases are stored in JSON format and will be used in the next evaluation step.
Run the following command to start the evaluation of all test cases:
During the evaluation, the evaluation framework simulates a real conversation between a user and the target AI agent. A user agent, which is powered by a large language model (LLM), reads the user story, interprets the context and intent, and interacts with the target AI agent as a user would. This process helps you to verify whether the target AI agent selects the correct tools and produces the expected response.
After the evaluation completes, you can view the final evaluation report.
The metrics table shows several values for each user story, including:
Total steps taken
LLM steps
Total tool calls
Tool call precision
Tool call recall
Agent routing accuracy
Text match
Journey success (0% or 100%)
Average response time
Use the Journey success value to identify which tools or AI agents require improvement. If a user story fails, review the associated tool or script and update it as needed.
You can view all evaluation metrics, simulated user interactions, and detailed analysis in the folder user_stories/test_execution/.
This method captures live chat sessions from the chat interface and converts the sessions into structured datasets. Each chat session is recorded in real time, and the data is saved in a separate annotated file for later analysis. You can use these datasets to compare AI agent behavior across different configuration setups, such as model types, agent descriptions, or parameter settings.
Flow:
Record the user interaction → Run the evaluation by using the recorded dataset
Record the interaction.
a. Run the following command to start recording the interactions:
orchestrate evaluations record --output-dir ./chat_recordings --env-file .env
Copy codeCopied!
b. Recording is now active. All recorded interactions are saved in the chat_recordings folder.
c. Test the AI agent in the watsonx Orchestrate user interface
Use the following sample questions. These interactions will be saved in the chat_recordings folder after you stop the recording.
My username is Alice. I want to find my current account balance.
I want to list my last five transactions for my account.
I want to update my registered email address to alex.new@abc.com
I want to know whether I have sufficient balance to transfer 5000 USD.
Stop the recording.
a. When you complete your session, press Ctrl+C in the terminal where the record command is running. Make sure that the conversation is fully complete before stopping the recording so the dataset is not cut off.
b. You can view all chat recordings in the chat_recordings folder.
Note:
The annotated data is generated automatically, so review the data before using it for evaluation. Remove any details that do not apply to your tests.
The starting_sentence field is derived directly from your inputs, but fields such as story and goals are generated from the recorded conversation and might need validation.
Generate the evaluation report by using the recorded chat dataset.
When the test cases appear, select Run to start the evaluation.
When the evaluation completes and the status shows Completed, select the completed entry to view the evaluation metrics.
View the evaluation metrics.
Answer quality metrics include Faithfulness, Relevance, and Correctness.
Tool quality metrics include Accuracy and Relevance.
Message completion metrics indicate Success or Failure.
For detailed information about these metrics, refer to the documentation Evaluating agents and tools.
You can download the evaluation report by clicking the Download button.
Summary
This tutorial explained how to set up an evaluation framework to assess tools and agent behavior. The evaluation framework establishes a clear workflow and ensures consistent testing across the agent development lifecycle.
By implementing evaluation early in the agent development lifecycle, you create a data-driven feedback loop that improves agent performance and identifies opportunities to optimize tool behavior. Using the same evaluation steps and metrics helps standardize the development and testing process.
This approach provides actionable insights, supports consistent quality checks, and improves the overall robustness of AI agents before deployment.
Acknowledgments
This tutorial was produced as part of the IBM Open Innovation Community initiative: Agentic AI (AI for Developers and Ecosystem).
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.