This is a cache of https://developer.ibm.com/articles/awb-sql-evaluation-llm-generated-sql-queries/. It is a snapshot of the page as it appeared on 2025-11-17T02:36:45.268+0000.
SQL evaluation framework for accurate query assessment - IBM Developer
SQL evaluation framework for accurate query assessment
Explore a systematic approach to evaluating SQL query generation by comparing results against predefined gold queries while accounting for valid variations
SQL-Eval is a framework designed to evaluate the correctness of SQL queries generated by Large Language Models (LLMs). Developed by the defog.ai team, it supports multiple LLMs, including OpenAI, AWS Bedrock, Gemini, MLX, Mixtral, Anthropic, and more.
I contributed by developing a watsonx wrapper, integrating ChromaDB for few-shot learning, and implementing self-correcting SQL queries.
How SQL-Eval works
Generates a SQL query from an LLM after retrieving relevant columns and metadata.
Executes both the gold (expected) query and the LLM-generated query on their respective databases to obtain results.
Compares the two dataframes using exact match and subset match to calculate the accuracy rate.
Logs metrics (for example, token usage, latency) and aggregates results for reporting.
Handling multiple correct queries
SQL-Eval ensures that different valid SQL queries are recognized as correct.
For example, the question:
"Who are the 10 most recent users from Toronto?"
Both of these queries are valid:
Query 1
SELECT userid, username, created_at
from users
where city='Toronto'orderby created_at
DESC LIMIT 10;
Copy codeCopied!
Query 2
SELECT userid, firstname ||' '|| lastname, created_at
from users
where city='Toronto'orderby created_at
DESC LIMIT 10;
Copy codeCopied!
SQL-Eval validates such variations, ensuring that all correct queries are accepted.
SQL-Eval follows key principles to ensure accurate and reliable evaluation:
Reproducible and repeatable: Ensures consistent evaluations following the scientific method.
Measures SQL complexity and diversity: Handles nested queries, multiple joins, and different ways of expressing SQL logic.
Creating the SQL-Eval dataset
The defog.ai team created a dataset of questions and gold query pairs, where the gold query produces the expected results. Unlike other benchmarks that rely on string-based matching (e.g., ROUGE), SQL-Eval focuses on result accuracy rather than exact SQL syntax.
Since multiple SQL queries can correctly ans
r the same question, SQL-Eval accepts different valid formats.
Example: top 3 restaurants in New York
All four queries below are correct:
Query 1
SELECT name
FROM restaurants
GROUPBY name
ORDERBYAVG(rating) DESC LIMIT 3
Copy codeCopied!
Query 2
SELECT id, name
FROM restaurants
GROUPBY name
ORDERBYAVG(rating) DESC LIMIT 3
Copy codeCopied!
Query 3
SELECT name, AVG(rating)
FROM restaurants
GROUPBY1ORDERBY2DESC LIMIT 3`
Copy codeCopied!
Query 4
SELECT name, AVG(rating)
FROM restaurants
GROUPBY1ORDERBY2DESC LIMIT 3`
Copy codeCopied!
Handling multiple correct queries
To ensure flexibility, SQL-Eval allows multiple acceptable column variations. For example:
SELECT {id, user_name} FROM users;
This means any of the following queries are valid:
SELECT id FROM users;
SELECT user_name FROM users;
SELECT id, user_name FROM users;
Ensuring uniform SQL representation
SQL-Eval considers various SQL structures, including:
Joins: Supports multiple joins and nested queries.
Aggregations: Evaluates GROUP BY usage.
Ordered results: Assesses ORDER BY statements.
Ratios and calculations: Handles expressions like SELECT X / Y, ensuring meaningful pre-aggregation and avoiding division by zero.
Filters: Tests the effectiveness of WHERE conditions.
Evaluation method
Big-picture approach
Validating SQL queries deterministically is challenging, but SQL-Eval takes this approach by comparing executed queries against predefined gold queries. Instead of relying on exact SQL matches, it evaluates logical equivalence, allowing for variations while handling edge cases.
To achieve this, define a minimal set of acceptable query alternatives in the question/query pairs. The detailed logic behind this process is available in a Python function in our repository. The following example illustrates this process.
For the question, "Return users and whether they like movies", an acceptable gold query is:
SELECT {uid, name}, likes_movies FROM users;
Since the question does not specify whether uid, name, or both should be included, multiple variations are considered correct. SQL-Eval ensures these flexible interpretations are accounted for.
Evaluation steps
We begin by expanding the braces in the gold query to generate all possible column combinations. This results in 22−1 (excluding the empty set) valid queries:
SELECT uid, likes_movies FROM users
SELECT name, likes_movies FROM users
SELECT uid, name, likes_movies FROM users
Each of these queries is then executed against the database, producing a corresponding dataframe. For brevity, we use two rows in this example:
After executing the gold queries and the generated query, compare their results. If the generated query's output matches any of the gold query results exactly (ignoring data types), it is marked as correct. This comparison is handled by the compare_df function.
In the example below, only the first dataframe in dfs_gold matches exactly, as it contains the same columns (uid, likes_movies) with identical values.
Example:
# result of executing a generated query:
# `SELECT u.id, u.likes_movies FROM users u`
df_generated = pd.DataFrame({"uid": [1, 2], "likes_movies": [True, False]})
compare_df(df_generated, dfs_gold[0]) # True
compare_df(df_generated, dfs_gold[1]) # False
compare_df(df_generated, dfs_gold[2]) # False
Copy codeCopied!
However, some variations of the generated query should still be considered correct. Some variations in query results should still be considered correct, such as:
Column aliases: Queries that rename columns but return the same data.
SELECT u.uid AS id, likes_movies FROM users u;
id likes_movies
1True2False
Copy codeCopied!
Additional columns: Queries that include extra, non-essential columns.
id name likes_movies
1 alice True2 bob False
Copy codeCopied!
Different row order: Queries that return the correct data but in a different order.
SELECT uid, likes_movies FROM users u ORDER BY likes_movies;
id likes_movies
2False1True
Copy codeCopied!
These variations are harmless and may even be useful depending on the context. To account for them, use a relaxed comparison function called subset_df. If a generated query does not match exactly using the compare_df function, the subset_df function checks whether the gold query result is a subset of the generated query result.
For each column in df1, check if the same values exist in df2, ignoring data types, column names (which may be aliased), and row order.
Extract the relevant columns from df2, rename them to match df1, and verify that the resulting dataframe matches df1. This prevents incorrect matches caused by shuffled columns, which can occur with low-cardinality data types like booleans or enums.
Conclusion
This evaluation method accounts for harmless result variations, such as column renaming, additional columns, and row reordering.
You can test your own prompt-engineered solutions using the watsonx API (as outlined in the README) since fine-tuning a model may not always be feasible.
If you're considering a text-to-SQL model or product, assess whether this benchmark aligns with your needs—particularly in terms of semantic diversity, SQL complexity, and specificity within a narrow domain.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.