[Text-to-SQL] Learning to query tables with natural language

How to model a natural language interface for relational databases

The views expressed on this post are mine alone and do not reflect the views of my employer, Microsoft.

Text-to-SQL is a task to translate a user’s query spoken in natural language into SQL automatically. It is the project that I’m working on at Microsoft.

If this problem is solved, it’s going to be widely useful because the vast majority of data in our lives is stored in relational databases. In fact, Healthcare, financial services, and sales industries exclusively use the relational database. This means the industries that can’t afford to lose transactions solely use the relational database. (You can risk losing social media comments here and there but you don’t want to risk losing transaction records of your credit card.) Also writing SQL queries can be prohibitive to non-technical users. Bill Gates noticed this problem and he himself(!) wrote down 105 questions (my team is working on 70 of them) that he wants a machine to be able to answer given enterprise databases. If we could enable people to directly interact with large-scale enterprise databases using natural language or voice, it’s going to be very useful.

For the past two years, we have witnessed a major advancement in NLP downstream tasks thanks to contextualized embeddings on webscale data and transfer learning. This progress made me hopeful that we should be able to solve this relatively easy (?) problem. I mean, a translation should be easier than a chatbot! shouldn’t it?

1. Why a Relational Database?

(when MongoDB is ‘webscale’)

The relational database has stood the test of time.

This Youtube video is one of my all-time favorite tech videos and it talks about the reasons why I’m working on Text-to-SQL, not Text-to-NoSQL. The relational database was blamed for the lack of scalability about 10 years ago and NoSQL came into the market as a promising solution. However, we’ve learned over time that NoSQL can be a wrong tool for many application use cases. Also, the relational database has evolved to be able to scale horizontally and serve data with low latency while meeting availability requirements. These newSQLs are Azure Data Warehouse, Google Spanner, MemSQL, CockroachDB, etc.

I’m confident that the relational database is here to stay.

2. Define the problem

Imagine a superstar recruiter, who can sell any job to anyone and also has extensive networks. But he has one drawback — he doesn’t know how to query the SQL database. He has experience and insights that he can draw from the data, but he just doesn’t have enough time to learn SQL.

Problem: Given a database (table), translate human's NL query into SQL.

There are thousand different types of SQL queries that a user can ask, but we are focusing on solving the following cases only which we believe cover most of enterprise use cases:

Image for post
Image for post

a. One WHERE condition

Q: What is Mark Zuckerberg's salary?SQL: SELECT Base Salary 
Name = ‘Mark Zuckerberg’

To successfully translate “What is Mark Zuckerberg’s salary?” into SQL, the model needs to know that it should look up “Mark Zuckerberg” in the ‘Name’ column. Also, it should infer that “salary” means the ‘BaseSalary’ column. Finally, the model should be able to construct the full SQ by correctly classifying those two columns as SELECT and WHERE, not the other way around.

b. Multiple WHERE conditions

Q: who makes less than 100k base in software engineering dept?SQL: SELECT Name 
FROM Compensation
WHERE Department = ‘Software Engineering’ AND Base Salary < 100000

The model should match “who” with the ‘Name’ column, “software engineering” with the ‘Department’ column and “make … base” with the ‘Base Salary’ column. It also needs to predict “<” operator for “less than 100k”.

c. With an aggregation

Q: “total stock award given out to ppl in data science hired after 2000?"SQL: SELECT SUM(Stock Awards)
FROM Compensation
WHERE Department = 'Data Science' AND Hiring Date > 2000-01-01

The model needs to predict the aggregator SUM and also convert “hired after 2000” into ‘Hiring Date > 2000–01–01’.


Q: “What is Mark Zuckerberg's salary?"SQL: SELECT Base Salary 
Compensation c
JOIN Employees e ON c.EmployeeId = e.EmployeeId
e.FirstName = ‘Mark' AND e.LastName = 'Zuckerberg’

The example table “Compensation” has employees’ names and salary numbers altogether. However, that’s rarely the case in real life. The table will be divided into subject-based tables ‘Employee’ and ‘Compensation’ in order to reduce redundant columns. And it’s our job to join them using the EmployeeId. Then this problem becomes much harder.

How do we build such a model to know these all?

3. Solution (Research)

In research settings, text-to-SQL falls under the semantic parsing — the task of converting natural language to a logical form.

One way to track the progress of research is through the benchmark. WikiSQL is one of the most popular benchmarks in semantic parsing. It is a supervised text-to-SQL dataset, beautifully hand-annotated by Amazon Mechanical Turk.

The state-of-the-art (as of May. 2020) models on WikiSQL leaderboard frame SQL generation as a classification problem of predicting six SQL components. Some of the early works on WikiSQL modeled this as a sequence generation problem using seq2seq but we are moving away from it.

Image for post
Image for post
1. SELECT column2. AGGREGATION: None, COUNT, SUM, AVG, MIN, MAX3. Number of WHERE conditions: 0, 1, 2, 34. WHERE column5. WHERE OPERATORS: =, <, >6. WHERE VALUE: a textspan (the beginning and the end position of the span using probability) in the NL question. It depends on both selected where-column and where operators.

Let’s say you are building this classification model from scratch. What are the challenges you need to tackle?

  • The same questions will be spoken in so many different ways.

For example:
“What is Mark Zuckerberg’s salary?”
“How much does Mark Zuckerberg make?”
“mark zuckerburg base salaray” (with typos, no capitalization)

Solution: Use the contextualized (language model pre-trained) word representation such as BERT, RoBERTa, XLNET, etc.

  • We need to match natural language with database schema (column names).

Solution: Use the attention. Not only use a natural language query as an input, but also the column names of the table as well and take advantage of the attention mechanism.

Image for post
Image for post

Attention helps the model focus on the query words (or subwords) that are most relevant to each label during the training.

  • We are not only predicting WHERE columns but also WHERE operators.

Solution: Use Execution-guided decoding. Notice that certain operators won’t be used with certain types. For example, you can’t sum over a column with a string type. You can’t apply > or < to a column of string type either. Execution-guided decoding detects and excludes faulty SQL during the decoding. Another thing that we can try is to use the type encoding as an input as well.

A common architecture in recent research

The top three models on WikiSQL leaderboard as of May 2020 — HydraNet, X-SQL, SQLova — share a similar architecture.

Image for post
Image for post
Image for post
Image for post

Typical architecture of Text-to-SQL models

  • Input: NL question + column names.

We design the architecture so that it accepts the concatenation as an input. Some models take type embedding or positional embedding as an input as well. Column names can consist of several words.

  • 1st Layer: Encode the input with the contextualized word (or BPE) representation.

There are many such off the shelf encoders: BERT, RoBERTa, XLNet, ELMo, Glove, etc.

  • 2nd Layer: Strengthen the encoder output with bi-LSTM or self-attention.

The goal of this step is to capture a larger context of an input. Bi-LSTM or self-attention enables the model to access information about prior, current, and future input. The resulting embeddings don’t merely capture the local context of a word — they can, in principle, capture the entire semantic of the input.

  • 3rd Layer: Six classification modules

The final layer consists of classification modules over six different components of the final SQ. Each module has its own attention weight to align the hidden representation and the final label.

1. SELECT column2. AGGREGATION: None, COUNT, SUM, AVG, MIN, MAX3. Number of WHERE conditions: 0, 1, 2, 34. WHERE column5. WHERE OPERATORS: =, <, >6. WHERE VALUE: a textspan (the beginning and the end position of the span using probability) in the NL question. It depends on both selected where-column and where operators.

During the training, we minimize the loss — a summation of six individual sub-task losses using cross-entropy or Kullback–Leibler divergence.

Inference is pretty straightforward. The highest-scoring labels will be returned from each module. If the highest-scoring column is [None], we ignore the output from the WHERE-number predictor and return zero WHERE condition.

The three models share a similar architecture but they are also different in detail: For example, while SQLova mainly uses LSTM (the question and table schema both are encoded through RNN), X-SQL completely removed LSTM and went self-attention all the way. SQLova uses BERT, X-SQL uses MT-DNN and HydraNet uses RoBERTa as an encoder.

4. How is the real-life (production) different from research?

why can’t many research techniques be directly applied to production?

  • Typical enterprise database queries require multiple tables with JOIN, not just a single table.

While WikiSQL is an exceptionally well-made dataset, it has one pitfall — it assumes NO JOIN. It takes for granted that a single table will have all the columns you need for a query. However, you will need a JOIN in real life because enterprise databases are normalized into subject-base tables.

  • Mismatch between NL and DB values

Research datasets assume the values in natural language questions (e.g. how much did we pay Zuck last month?) would appear exactly the same way (last month) in the database. In production, the natural language ‘last month’ is stored as ‘2020–02–01’ in the database, ‘10k’ will be ‘10,000’ in the database, and ‘NY’ will be ‘New York’ in the database. The model should know how to match these NL slangs with the values in DB.

  • Interdependencies between columns

For example, when ‘Employment Status’ is “Not active”, ‘Last Salary’ column should be returned, not the ‘Current Salary’.

  • Similar column names in enterprise databases, always!

For example, “EstimatedRevenue” vs “ActualRevenue”, “CreatedOn” vs “ModifiedOn”, “CreatedOn” vs “CreatedBy”, ‘LastSalary’ vs ‘Salary’, etc.

  • The sheer difference in database size

The number of columns in WikiSQL dataset is usually 5–7 and the number of rows is about 10. Meanwhile, a typical enterprise table has 30–40 columns and millions of rows in a single table.

5. Solution (Production)

So how do we solve the challenges of Text-to-SQL in production?

Surprisingly, it’s not the more complex model. It’s the training data that matters the most to the performance of the DL model.

The main benefit of deep learning over conventional machine learning is that deep learning largely depends on the data with little or no features engineering. If we use more high-quality training data, the model almost always outperform. In other words, DL models are often constrained by the availability of large-scale high-quality training data.

The details to generate training data are out of scope for this article (I’ll later talk about data generation/augmentation in a separate post) but here are the techniques that I’m using to generate accurate & cost-effective training data:

  • Weak supervision
  • Paraphrases obtained from back translation
  • Adversarial examples/training
  • Interpolation/Extrapolation of nearest neighbor words
  • Iterative data cleaning
  • Random word/character insertion, swap, deletion
  • Template slot filling (with DB values)

CDM is a standard & extensible schema (entities, attributes, relationships) that represents concepts that are frequently used in enterprises. CDM exists in order to facilitate data interoperability. For example, every company regardless of its industry will have some kind of customer management database and have tables such as Account, Contact, Product, etc. Clients can name the columns differently as long as they match with CDM and as long as the database is CDM compliant, we can easily fine-tune the pre-trained Text-to-SQL model.

In production, we have access to the cell content of the database. (If we do this in research, it’s cheating.) We build an index on the cell values so that it can identify the values in the NL query and finding the most appropriate columns.

The key idea of execution-guided decoding is that a partially generated SQL can be executed and the results of that execution can be used to guide the rest of SQL generation. For example, the execution-guided decoder would eliminate an incorrect string-to-string inequality comparison “… WHERE department > ‘software engineering’ …” from the beam immediately after the token ‘software engineering’ is returned. Excluding erroneous candidates and not proceeding further during the beam search increases the accuracy of the model.

Personally, I find translating a natural language into executable SQL a fascinating problem to work on and I see this project as more than a moon shot. Text-to-SQL is a project that has a potential to be solved, and once it’s solved, it is going to be widely applicable given how extensively relational databases are used in every sector. Yet surprisingly not many labs are working on this problem. As far as I know, the only teams that are actively working on Text-to-SQL are my team at Microsoft, Salesforce, and Naver (Korean Search Engine). If you are reading this and if you are working on Text-to-SQL problem, let me know in the comments!

Written by

I’m an Engineering Manager at Scale AI and this is my notepad for Applied Math / CS / Deep Learning topics. Follow me on Twitter for more!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store