About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Article
SQL evaluation framework for accurate query assessment
Explore a systematic approach to evaluating SQL query generation by comparing results against predefined gold queries while accounting for valid variations
On this page
SQL-Eval is a framework designed to evaluate the correctness of SQL queries generated by Large Language Models (LLMs). Developed by the defog.ai team, it supports multiple LLMs, including OpenAI, AWS Bedrock, gemini, MLX, Mixtral, Anthropic, and more.
I contributed by developing a watsonx wrapper, integrating ChromaDB for few-shot learning, and implementing self-correcting SQL queries.
How SQL-Eval works
- generates a SQL query from an LLM after retrieving relevant columns and metadata.
- Executes both the gold (expected) query and the LLM-generated query on their respective databases to obtain results.
- Compares the two dataframes using
exact matchandsubset matchto calculate the accuracy rate. - Logs metrics (for example, token usage, latency) and aggregates results for reporting.
Handling multiple correct queries
SQL-Eval ensures that different valid SQL queries are recognized as correct.
For example, the question: "Who are the 10 most recent users from Toronto?"
Both of these queries are valid:
Query 1
SELECT userid, username, created_at
from users
where city='Toronto'
order by created_at
DESC LIMIT 10;
Query 2
SELECT userid, firstname || ' ' || lastname, created_at
from users
where city='Toronto'
order by created_at
DESC LIMIT 10;
SQL-Eval validates such variations, ensuring that all correct queries are accepted.
gitHub repository: SQL-generation-Evaluation
Principles of SQL-Eval
SQL-Eval follows key principles to ensure accurate and reliable evaluation:
- Reproducible and repeatable: Ensures consistent evaluations following the scientific method.
- Measures SQL complexity and diversity: Handles nested queries, multiple joins, and different ways of expressing SQL logic.
Creating the SQL-Eval dataset
The defog.ai team created a dataset of questions and gold query pairs, where the gold query produces the expected results. Unlike other benchmarks that rely on string-based matching (e.g., ROUgE), SQL-Eval focuses on result accuracy rather than exact SQL syntax.
Since multiple SQL queries can correctly ans r the same question, SQL-Eval accepts different valid formats.
Example: Top 3 restaurants in New York
All four queries below are correct:
- Query 1
SELECT name
FROM restaurants
gROUP BY name
ORDER BY AVg(rating) DESC LIMIT 3
- Query 2
SELECT id, name
FROM restaurants
gROUP BY name
ORDER BY AVg(rating) DESC LIMIT 3
- Query 3
SELECT name, AVg(rating)
FROM restaurants
gROUP BY 1
ORDER BY 2 DESC LIMIT 3`
- Query 4
SELECT name, AVg(rating)
FROM restaurants
gROUP BY 1
ORDER BY 2 DESC LIMIT 3`
Handling multiple correct queries
To ensure flexibility, SQL-Eval allows multiple acceptable column variations. For example:
SELECT {id, user_name} FROM users;
This means any of the following queries are valid:
SELECT id FROM users;SELECT user_name FROM users;SELECT id, user_name FROM users;
Ensuring uniform SQL representation
SQL-Eval considers various SQL structures, including:
- Joins: Supports multiple joins and nested queries.
- Aggregations: Evaluates
gROUP BYusage. - Ordered results: Assesses
ORDER BYstatements. - Ratios and calculations: Handles expressions like
SELECT X / Y, ensuring meaningful pre-aggregation and avoiding division by zero. - Filters: Tests the effectiveness of
WHEREconditions.
Evaluation method
Big-picture approach
Validating SQL queries deterministically is challenging, but SQL-Eval takes this approach by comparing executed queries against predefined gold queries. Instead of relying on exact SQL matches, it evaluates logical equivalence, allowing for variations while handling edge cases.
To achieve this, define a minimal set of acceptable query alternatives in the question/query pairs. The detailed logic behind this process is available in a Python function in our repository. The following example illustrates this process.
Example: User preferences query
Consider the following database schema:
CREATE TABLE users (
uid BIgINT,
name TEXT,
likes_movies BOOLEAN,
likes_plays BOOLEAN
)
For the question, "Return users and whether they like movies", an acceptable gold query is:
SELECT {uid, name}, likes_movies FROM users;
Since the question does not specify whether uid, name, or both should be included, multiple variations are considered correct. SQL-Eval ensures these flexible interpretations are accounted for.
Evaluation steps
We begin by expanding the braces in the gold query to generate all possible column combinations. This results in 22−1 (excluding the empty set) valid queries:
SELECT uid, likes_movies FROM usersSELECT name, likes_movies FROM usersSELECT uid, name, likes_movies FROM users
Each of these queries is then executed against the database, producing a corresponding dataframe. For brevity, we use two rows in this example:
dfs_gold = [ pd.DataFrame({"uid": [1, 2], "likes_movies": [True, False]}), pd.DataFrame({"name": ["alice", "bob"], "likes_movies": [True, False]}), pd.DataFrame({"uid": [1, 2], "name": ["alice", "bob"], "likes_movies": [True, False]}), ]After executing the gold queries and the generated query, compare their results. If the generated query's output matches any of the gold query results exactly (ignoring data types), it is marked as correct. This comparison is handled by the compare_df function.
In the example below, only the first dataframe in
dfs_goldmatches exactly, as it contains the same columns (uid, likes_movies) with identical values.Example:
# result of executing a generated query: # `SELECT u.id, u.likes_movies FROM users u` df_generated = pd.DataFrame({"uid": [1, 2], "likes_movies": [True, False]}) compare_df(df_generated, dfs_gold[0]) # True compare_df(df_generated, dfs_gold[1]) # False compare_df(df_generated, dfs_gold[2]) # FalseHowever, some variations of the generated query should still be considered correct. Some variations in query results should still be considered correct, such as:
Column aliases: Queries that rename columns but return the same data.
SELECT u.uid AS id, likes_movies FROM users u;id likes_movies 1 True 2 FalseAdditional columns: Queries that include extra, non-essential columns.
id name likes_movies 1 alice True 2 bob FalseDifferent row order: Queries that return the correct data but in a different order.
SELECT uid, likes_movies FROM users u ORDER BY likes_movies;id likes_movies 2 False 1 True
These variations are harmless and may even be useful depending on the context. To account for them, use a relaxed comparison function called subset_df. If a generated query does not match exactly using the compare_df function, the subset_df function checks whether the gold query result is a subset of the generated query result.
Subset evaluation
Here’s how the subset_df function works:
For each column in
df1, check if the same values exist indf2, ignoring data types, column names (which may be aliased), and row order.Extract the relevant columns from
df2, rename them to matchdf1, and verify that the resulting dataframe matchesdf1. This prevents incorrect matches caused by shuffled columns, which can occur with low-cardinality data types like booleans or enums.
Conclusion
This evaluation method accounts for harmless result variations, such as column renaming, additional columns, and row reordering.
You can test your own prompt-engineered solutions using the watsonx API (as outlined in the README) since fine-tuning a model may not always be feasible.
If you're considering a text-to-SQL model or product, assess whether this benchmark aligns with your needs—particularly in terms of semantic diversity, SQL complexity, and specificity within a narrow domain.
Explore our repository here: SQL generation Evaluation.