Nate Rosidi
2025-08-12 10:00:00
www.kdnuggets.com


Image by Author | Canva
With large lagnuage models (LLMs), everyone is a coder today! This is a message you get from the LLM promo materials. It’s obviously not true, just like any ad. Coding is much more than producing code at breakneck speed. However, translating English (or other natural languages) into executable SQL queries is one of the most compelling uses of LLMs, and it has its place in the world.
# Why Use LLMs to Generate SQL?
There are several benefits of using LLMs to generate SQL, and, as with everything, there are also some cons.
# Two Types of Text-to-SQL LLMs
We can distinguish between two very broad types of text-to-SQL technology currently available regarding their access to your database schema.
- LLMs without direct access
- LLMs with direct access
// 1. LLMs Without Direct Access to Database Schema
These LLMs don’t connect to or execute queries against the actual database. The closest you can get is to upload the datasets you want to query. These tools rely on you providing context about your schema.
Tool Examples:
Use Cases:
- Query drafting and prototyping
- Learning and teaching
- Static code generation for later review
// 2. LLMs With Direct Access to Database Schema
These LLMs connect directly to your live data sources, such as PostgreSQL, Snowflake, BigQuery, or Redshift. They allow you to generate, execute, and return results from SQL queries live on your database.
Tool Examples:
Use Cases:
- Conversational analytics for business users
- Real-time data exploration
- Embedded AI assistants in BI platforms
# Step-by-Step: How to Go from Text to SQL
The basic workflow of getting SQL from text is similar, whether you use disconnected or connected LLMs.
We’ll try to solve an interview question from Shopify and Amazon using the steps above in ChatGPT.
// 1. Define the Schema
For the query to work on your data, the LLM needs to understand your data structure clearly. This typically encompasses:
- Table names
- Column names and types
- Relationships between tables (joins, keys)
This information can be passed directly in the prompt or can be retrieved dynamically using vector search within the retrieval-augmented generation (RAG) pipeline.
// 2. Prompt With Natural Language
The prompt will typically consist of two segments:
- Schema definition
- Question(s) for which we need an SQL answer
Example: Let me first provide you with a prompt structure that includes placeholders. We’ll then write an actual prompt.
We will use role-play prompting, which means instructing ChatGPT to assume a specific role.
Here’s how to structure the prompt.
Dataset: My dataset consists of [number of tables] tables.
The first one is [table “” not found /]
with the following columns and data types:
[column names and data types]
The second table is [table “” not found /]
with the following columns and data types:
[column names and data types]
Question: [provide a question to be answered]
Assumptions: [provide assumptions for solving the question]
Role: [describe a role LLM has to play]
Let’s now write an actual prompt that solves our interview questions.
Here it is, and we hope it will provide a correct SQL solution. The column names, data types, question, and assumptions are just pasted from the question itself.
Dataset: My dataset consists of two tables.
The first one is "customers" with the following columns and data types:
address:
text
city:
text
first_name:
text
id:
bigint
last_name:
text
phone_number:
text
The second table is "orders" with the following columns and data types:
cust_id:
bigint
id:
bigint
order_date:
date
order_details:
text
total_order_cost:
bigint
Question: Find the customers with the highest daily total order cost between 2019-02-01 and 2019-05-01. If a customer had more than one order on a certain day, sum the order costs on a daily basis. Output each customer's first name, total cost of their items, and the date.
Assumption: For simplicity, you can assume that every first name in the dataset is unique.
Role: Act as a SQL expert and write a PostgreSQL query that answers the question.
// 3. Generate the SQL Query
In this stage, the LLM of your choice generates the SQL query.
Example: Here’s the PostgreSQL code ChatGPT produced for us.
SELECT
c.first_name,
daily_totals.order_date,
daily_totals.total_cost
FROM (
SELECT
cust_id,
order_date,
SUM(total_order_cost) AS total_cost
FROM orders
WHERE order_date BETWEEN '2019-02-01' AND '2019-05-01'
GROUP BY cust_id, order_date
) AS daily_totals
JOIN customers c ON c.id = daily_totals.cust_id
WHERE (daily_totals.cust_id, daily_totals.order_date) IN (
SELECT
cust_id,
order_date
FROM (
SELECT
cust_id,
order_date,
SUM(total_order_cost) AS total_cost,
RANK() OVER (ORDER BY SUM(total_order_cost) DESC) AS rnk
FROM orders
WHERE order_date BETWEEN '2019-02-01' AND '2019-05-01'
GROUP BY cust_id, order_date
) ranked
WHERE rnk = 1
);
// 4. Execute the Query
Now, you can execute the query, be it directly (if your LLM supports live connections) or by copying and running it within your database system.
Example: In our case, we will copy the code into the StrataScratch code editor and use the ‘Check Solution’ button to validate the solution.
This is what it outputs, and it’s a correct solution. Bravo for ChatGPT! It nailed it on the first try!
first_name | order_date | total_cost |
---|---|---|
Jill | 2019-04-19 | 275 |
Mark | 2019-04-19 | 275 |
// 5. Review, Visualize, and Refine
Depending on the purpose of using LLMs to write SQL code, this step may be optional. In the business world, you’d typically present the query output in a user-friendly format, which typically involves:
- Showing results as a table and/or chart
- Allowing follow-up requirements (e.g., “Can you include the customer city?”) and providing the changed query and output
# Pitfalls and Best Practices
In our example, ChatGPT immediately came up with the correct answer. However, it doesn’t mean it always does, especially when data and requirements get more complicated. Using LLMs to get SQL queries from text is not without pitfalls. You can avoid them by applying some best practices if you want to make LLM query generation a part of your data science workflow.
# Conclusion
LLMs can be your best friend when you want to create SQL queries from text. However, to make the best of these tools, you must have a clear understanding of what you want to achieve and the use cases where using LLMs is beneficial.
This article provides you with such guidelines, along with an example of how to prompt an LLM in natural language and get a working SQL code.
Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.
Transform your cleaning routine with the Shark AI Ultra Voice Control Robot Vacuum! This high-tech marvel boasts over 32,487 ratings, an impressive 4.2 out of 5 stars, and has been purchased over 900 times in the past month. Perfect for keeping your home spotless with minimal effort, this vacuum is now available for the unbeatable price of $349.99!
Don’t miss out on this limited-time offer. Order now and let Shark AI do the work for you!
Help Power Techcratic’s Future – Scan To Support
If Techcratic’s content and insights have helped you, consider giving back by supporting the platform with crypto. Every contribution makes a difference, whether it’s for high-quality content, server maintenance, or future updates. Techcratic is constantly evolving, and your support helps drive that progress.
As a solo operator who wears all the hats, creating content, managing the tech, and running the site, your support allows me to stay focused on delivering valuable resources. Your support keeps everything running smoothly and enables me to continue creating the content you love. I’m deeply grateful for your support, it truly means the world to me! Thank you!
BITCOIN bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge Scan the QR code with your crypto wallet app |
DOGECOIN D64GwvvYQxFXYyan3oQCrmWfidf6T3JpBA Scan the QR code with your crypto wallet app |
ETHEREUM 0xe9BC980DF3d985730dA827996B43E4A62CCBAA7a Scan the QR code with your crypto wallet app |
Please read the Privacy and Security Disclaimer on how Techcratic handles your support.
Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.