Tutorial

Generating SQL from text with LLMs

Convert text into a structured representation and generate a semantically correct SQL query

By Joshua Noble

In this tutorial, you learn how to use a large language model (LLM) from the Granite Model family developed by IBM to create valid structured query language (SQL) statements from normal descriptions of data operations. This kind of text-to-SQL task uses natural language processing (NLP) to generate SQL queries from natural language text. This is not a new concept. In fact, there's an entire dataset and competition devoted to natural language to SQL generation called Spider.

This task involves converting text input into a structured representation and using that structured data to generate a semantically correct SQL query that can be executed on a database. However, there are some important differences between human language, which can be ambiguous and context-dependent, and SQL, which is comparatively very precise and structured. So we have two parts to our task:

First, we need to understand what kinds of user might ask and determine which of our SQL tables should be used for that task.
Once we've done that, the recognized user intent can be used to generate SQL queries that will run on the tables. We could then fine-tune our LLM in order to set the names of tables and columns, and also to get the list of specific tasks used in specific places.

A different option is that we can also just pass some of the information required in the prompt itself. This is called prompt-tuning. We'll use the large context window of the 20 billion parameter Granite model to pass contextual information about our data to the model before it generates a response.

Prerequisites

You need an IBM Cloud account to create a watsonx.ai project.

Steps

Step 1: Set up your environment

In this step, we’ll guide you through creating an IBM account to access Jupyter Notebooks.

Log in to watsonx.ai using your IBM Cloud account.
Create a watsonx.ai project.
Create a Jupyter Notebook.

Choose "Runtime 23.1 on Python (4 vCPU 16 GB RAM)" to define the configuration.

Step 2. Set up the watsonx.ai Runtime service and API key

Create a watsonx.ai Runtime service instance (choose the Lite plan, which is a free instance).
Generate an API Key.
Associate the watsonx.ai Runtime service to the project that you created in watsonx.ai.

Step 3. Install and import relevant libraries and set up your credentials

First we need to install the libraries needed in order to run our notebook. We use the pip command with a ! appended so that the notebook knows to run this as a system command rather than python code.

!pip install wget | tail -n 1
!pip install datasets | tail -n 1
!pip install scikit-learn | tail -n 1
!pip install ibm-watsonx-ai==1.0.11 | tail -n 1
!pip install "ibm-watson-machine-learning>=1.0.327"

Next, we need to set up the credentials. We'll import the os and getpass libraries in order to do that:

import os
import getpass

Set up your credentials and input your API Key:

credentials = {
    "url": "https://us-south.ml.cloud.ibm.com",
    "apikey": getpass.getpass("Please enter your WML api key (hit enter): ")
}

Set up your project_id as part of your environment variables or input it:

try:

    project_id = os.environ["PROJECT_ID"]

except KeyError:

    project_id = input("Please enter your project_id (hit enter): ")

Step 4. Import our model into Watson Studio

Now we're ready to import our model into Watson Studio.

from ibm_watsonx_ai.foundation_models.utils.enums import ModelTypes
print([model.name for model in ModelTypes])

This should return a list of all of the pre-trained models that we can load:

['FLAN_T5_XXL', 'FLAN_UL2', 'MT0_XXL', 'GPT_NEOX', 'MPT_7B_INSTRUCT2', 'STARCODER', 'LLAMA_2_70B_CHAT', 'LLAMA_2_13B_CHAT', 'GRANITE_13B_INSTRUCT', 'GRANITE_13B_CHAT', 'FLAN_T5_XL', 'GRANITE_13B_CHAT_V2', 'GRANITE_13B_INSTRUCT_V2', 'ELYZA_JAPANESE_LLAMA_2_7B_INSTRUCT', 'MIXTRAL_8X7B_INSTRUCT_V01_Q', 'CODELLAMA_34B_INSTRUCT_HF', 'GRANITE_20B_MULTILINGUAL', 'MERLINITE_7B', 'GRANITE_20B_CODE_INSTRUCT', 'GRANITE_34B_CODE_INSTRUCT', 'GRANITE_3B_CODE_INSTRUCT', 'GRANITE_7B_LAB', 'GRANITE_8B_CODE_INSTRUCT', 'LLAMA_3_70B_INSTRUCT', 'LLAMA_3_8B_INSTRUCT', 'MIXTRAL_8X7B_INSTRUCT_V01']

In our case, we want to use the 20B parameter Granite Code Base. These models were trained on code and questions and answers about code so that they have exposure to both syntactically correct examples of code and contextual information about that code. This allow us to ask questions about specific functionality and have the model generate meaningful code samples.

The Granite foundation models are automatically available in watsonx.ai and accessible in the ibm_watsonx_ai library. They are also available as open source models in Github or in Hugging Face.

The Granite Code models have the ability to generate code from prompts in 116 languages including SQL generation.

We'll store the model name in a variable:

model_id = ModelTypes.GRANITE_20B_CODE_BASE

Now we set the parameters for our model.

MIN_NEW_TOKENS and MAX_NEW_TOKENS. Theses parameters set a range for our models. This keeps our answers from being too short or too long.
DECODING_METHOD. With this paramter, we select how our model picks the next word. Greedy search is the simplest decoding method, as it just selects the word with the highest probability as its next word. Another option is a sampling where the model samples tokens, which means that the model chooses a subset of tokens and then one token is chosen randomly from this subset to be added to the output text. Sampling adds variability and randomness to the decoding process, which can be helpful in creative use cases. That greater variability makes it more likely that we get incorrect or nonsensical output which is not great when we want to output correct code.
TEMPERATURE. This parameter is one of the most fundamental and widely used sampling methods for LLMs. It controls the randomness of the model's output by scaling the logits (unnormalized log probabilities) before applying the softmax function. Higher values lead to greater variability, lower values lead to less. We'll use a lower temperature here to keep our model from being too creative.

from ibm_watsonx_ai.metanames import GentextParamsMetaNames as GenParams

from ibm_watsonx_ai.foundation_models.utils.enums import DecodingMethods

parameters = {
    GenParams.MIN_NEW_TOKENS: 10,
    GenParams.MAX_NEW_TOKENS: 196,
    GenParams.DECODING_METHOD: DecodingMethods.GREEDY,
    GenParams.TEMPERATURE: 0.3,
    GenParams.REPETITION_PENALTY: 1
}

Now, we instantiate the model:

from ibm_watsonx_ai.foundation_models import Model

model = Model(
    model_id=model_id,
    params=parameters,
    credentials=credentials,
    project_id=project_id)

Step 5. Testing some prompts

To start, let's have our model create a SQL database and then create a table called Employees. The generate_text() method handles creating the embeddings, passing them to the model, and then decoding the results into English. All we need to do is pass the prompt:

create_data_base = """Write a SQL statement to create a database called CompanyDB and a table called Employees 
                      with first_name, last_name, employee_id, email"""

print(model.generate_text(prompt=create_data_base))

This prompt returns:

CREATE DATABASE CompanyDB;

USE CompanyDB;

CREATE TABLE Employees (
    first_name VARCHAR(50),
    last_name VARCHAR(50),
    employee_id INT,
    email VARCHAR(50),
    password VARCHAR(50)
);

This is is valid SQL syntax and, if passed to a SQL database, would create the correct column names and column types.

Without any changes in how we call the model, we can even insert data:

csv_input = """first_name,last_name,employee_id,email
                Juan,Valdez,19203,jvaldez@company.com
                Max,Mustermann,18131,mmustermann@company.com
                Hanako,Yamada,95824,hyamada@company.com
                Ivan,Ivanov,40194,iivanov@company.com"""

print(model.generate_text(prompt='Write an SQL insert statement for this csv: ' + csv_input))

This will return accurate SQL which could be used to insert those rows into the Employees table:

INSERT INTO Employees (first_name, last_name, employee_id, email)

VALUES ('Juan', 'Valdez', 19203, 'jvaldez@company.com'),
       ('Max', 'Mustermann', 18131,'mmustermann@company.com'),
       ('Hanako', 'Yamada', 95824, 'hyamada@company.com'),
       ('Ivan', 'Ivanov', 40194, 'iivanov@company.com');

The SQL is correct. We can use the same approach with JSON data or other plain-text data formats as well. When we want to begin to do more complex queries though, we'll want to make better use of the context window for the Granite Code model.

Step 6. Expanding the context window

A context window refers to the amount of text data that a language model can consider at one time when generating responses. It includes all the tokens (words or pieces of words) from the input text that the model looks at in order to create context before it generates a reply. The size of this window directly influences the model’s response accuracy and contextual appropriateness.

The context window’s size varies across different language models which impacts how they process and interpret text. A smaller window makes the model more efficient but limits the context that the model may need to generate an accurate or meaningful response. A much larger window can lead to the model processing of unnecessary amounts of information, not all of which is relevant.

Using the context window to pass information instead of fine-tuning saves compute time, since fine-tuning is computationally expensive, and also saves the effort of needing to create data.

Granite Code 20b has a context window size of 8192, which means that we can use up to 8192 tokens to set the context. In our case, we can pass information about the database schema in order to get more accurate answers from the model, even when we ask natural language questions that require the model to infer information.

sales_prompt = "Output only SQL statements. The Employees table contains data relating to users, people, and employees, the Sales table contains data relating to sales. The Employees table contains first_name, last_name, employee_id, email. The Sales table contains sale_id, customer_id, employee_id, sale_date. Which employee had the most sales in October 2023?"

print(model.generate_text(prompt=sales_prompt))

Because of the context that we passed, we get a correctly generated SQL query for this complex request:

SELECT e.employee_id, e.first_name, e.last_name, COUNT(s.sale_id) AS num_sales FROM Employees e JOIN Sales s ON e.employee_id = s.employee_id WHERE s.sale_date BETWEEN '2023-10-01' AND '2023-10-31' GROUP BY e.employee_id, e.first_name, e.last_name ORDER BY num_sales DESC LIMIT 1;

This SQL query here correctly interprets the request above and outputs valid SQL. If we didn't pass this context, we would get hallucinated results:

message = "Output only SQL statements. Which employee had the most sales in October 2023?"
print(model.generate_text(prompt=sales_prompt))

This would return:

SELECT employee_id, SUM(amount) AS total_sales
FROM sales
WHERE MONTH(sale_date) = 10 AND YEAR(sale_date) = 2023
GROUP BY employee_id
ORDER BY total_sales DESC
LIMIT 1;

We can see that the model has returned reasonable values but they don't correspond to the database that we've configured. It’s invented table names and columns that don't exist.

To remedy this, we can simply pass even more information in using our context window:

message = """Output only SQL statements.
The Employees table contains first_name, last_name, employee_id, email. 
The Sales table contains sale_id, customer_id, employee_id, sale_date. 
The Customer table contains customer_id, contact_email, contact_first_name, contact_last_name. 
Which employee email and customer email were associated with the most sales in October 2023"""

print(model.generate_text(prompt=message))

This returns the correct data and using the right aggregation:

Here is the SQL query that would solve this problem:

```sql
SELECT 
  e.email AS employee_email, 
  c.contact_email AS customer_email, 
  COUNT(s.sale_id) AS total_sales 
FROM 
  Employees e 
  JOIN Sales s ON e.employee_id = s.employee_id 
  JOIN Customer c ON s.customer_id = c.customer_id 
WHERE 
  MONTH(s.sale_date) = 10 
  AND YEAR(s.sale_date) = 2023 
GROUP BY 
  e.email, 
  c.contact_email 
ORDER BY 
  total_sales DESC 
LIMIT 1;

An alternative to simply using the context window would be to store the database table schema and information about the database in a vector store to access them using retrieval augmented generation (RAG). One IBM paper even shows techniques to use a graph database to infer information about databases from natural language.

Summary and next steps

There are many real-world uses for generating code with artificial intelligence and database access is one of the most common. You might imagine integrations directly with a SQL agent that turns natural language queries in English into usable SQL statements. While the SQL generated by an LLM like Granite Code or OpenAI's GPT-4 might need some optimization before it could be used at scale, for a simple lookup or small business inquiries like the ones shown in this tutorial, the generated code is perfectly sufficient.

In this tutorial, you learned how Granite Code models can be used to output valid SQL data for a database. We loaded the 20 billion parameter verson of Granite Code Instruct into a Jupyter notebook and prompted it with complex prompts for SQL queries. Using the expanded context window allows us to pass the names of specific tables, columns, and even associations between SQL entites and tasks to the the model. We can get accurate versions of even moderately complex queries like multiple joins from our model, giving us the possibility of building tools that allow complex data queries from natural language.

Try watsonx for free

Build an AI strategy for your business on one collaborative AI and data platform called IBM watsonx, which brings together new generative AI capabilities, powered by foundation models, and traditional machine learning into a powerful platform spanning the AI lifecycle. With watsonx.ai, you can train, validate, tune and deploy models with ease and build AI applications in a fraction of the time with a fraction of the data. These models are accessible to all as many no-code and low-code options are available for beginners.

Try watsonx.ai, the next-generation studio for AI builders.

Next steps

Explore more articles and tutorials about watsonx on IBM Developer.

Topics

Languages

Products

Open Source