How to model a natural language interface for relational databases

The views expressed on this post are mine alone and do not reflect the views of my employer, Microsoft.

Text-to-SQL is a task to translate a user’s query spoken in natural language into SQL automatically. It is the project that I’m working on at Microsoft.

If this problem is solved, it’s going to be widely useful because the vast majority of data in our lives is stored in relational databases. In fact, Healthcare, financial services, and sales industries exclusively use the relational database. This means the industries that can’t afford to lose transactions solely use the relational database. (You can risk losing social media comments here and there but you don’t want to risk losing transaction records of your credit card.) Also writing SQL queries can be prohibitive to non-technical users. Bill Gates noticed this problem and he himself(!) wrote down 105 questions (my team is working on 70 of them) that he wants a machine to be able to answer given enterprise databases. …


[No more confusion] How to find a p-value and ultimately reject the null hypothesis

If you read any scientific papers, e.g. medical, artificial intelligence, climate, political, etc., or any poll result, there is a term that almost always appears — the p-value.

But what exactly is a p-value? Why does it show up in all these contexts?

This table lists the symptoms and their p-values when you are infected with the novel coronavirus (COVID-19).

Image for post
Image for post
From one of the most cited COVID-19 papers — Clinical Characteristics of 138 Hospitalized Patients With 2019 Novel Coronavirus Infected Pneumonia in Wuhan, China

The only remarks about this table from the author were “Proportions for categorical variables were compared using the χ2 test. P values indicate differences between ICU (Intensive Care Unit) and non-ICU patients.”

Let’s say all doctors in the hospital fell sick from the coronavirus and you(!) are in charge of triaging the patients who need to go to ICU. There are only a limited number of beds in the ICU so you can’t just admit everyone. The only data that you can refer to is this table. (Please note that there will be some other factors that physicians will consider when it comes to ICU admission. This is for a pedagogical illustration only.) …


Scope, Examples, and Careers

The (somewhat vague) term “Operations Research” was coined during World War I. The British military brought together a group of scientists to allocate insufficient resources — for example, food, medics, weapons, troops, etc. — in the most effective way possible to different military operations. So the term “operations” is from “military operations”. Successfully conducting military operations was a huge deal and Operations Research (OR) became its own academic discipline in universities in the 40s.

Image for post
Image for post
Wikipedia page of Operations Research

When you google “Operations Research”, you get a very long Wikipedia article, however, the explanation is a little bit all over the place and to be honest, outdated as well. …


When to use Beta distribution

The Beta distribution is a probability distribution on probabilities. For example, we can use it to model the probabilities: the Click-Through Rate of your advertisement, the conversion rate of customers actually purchasing on your website, how likely readers will clap for your blog, how likely it is that Trump will win a second term, the 5-year survival chance for women with breast cancer, and so on.

Because the Beta distribution models a probability, its domain is bounded between 0 and 1.

1. Why does the PDF of Beta distribution look the way it does?

Image for post
Image for post
An excerpt from Wikipedia

What’s the intuition?

Let’s ignore the coefficient 1/B(α,β) for a moment and only look at the numerator x^(α-1) * (1-x)^(β-1), because 1/B(α,β) is just a normalizing constant to make the function integrate to 1. …


With examples & proofs

1. What is Prior?

Prior probability is the probability of an event before we see the data.
In Bayesian Inference, the prior is our guess about the probability based on what we know now, before new data becomes available.

2. What is Conjugate Prior?

Conjugate prior just can not be understood without knowing Bayesian inference.

For the rest of the blog, I’ll assume you know the concepts of prior, sampling and posterior.

Conjugate prior in essence

For some likelihood functions, if you choose a certain prior, the posterior ends up being in the same distribution as the prior. Such a prior then is called a Conjugate Prior.

It is always best understood through examples. Below is the code to calculate the posterior of the binomial likelihood. θ is the probability of success and our goal is to pick the θ that maximizes the posterior probability.


with Python Code

Why did someone have to invent the Bayesian Inference?

In one sentence: to update the probability as we gather more data.

The core of Bayesian Inference is to combine two different distributions (likelihood and prior) into one “smarter” distribution (posterior). Posterior is “smarter” in the sense that the classic maximum likelihood estimation (MLE) doesn’t take into account a prior. Once we calculate the posterior, we use it to find the “best” parameters and the “best” is in terms of maximizing the posterior probability, given the data. This process is called Maximum A Posteriori (MAP). …


Its properties, proofs & graphs

Why should I care?

Many probability distributions are defined by using the gamma function — such as Gamma distribution, Beta distribution, Dirichlet distribution, Chi-squared distribution, and Student’s t-distribution, etc.
For data scientists, machine learning engineers, researchers, the Gamma function is probably one of the most widely used functions because it is employed in many distributions. These distributions are then used for Bayesian inference, stochastic processes (such as queueing models), generative statistical models (such as Latent Dirichlet Allocation), and variational inference. …


and why does it matter?

Before setting Gamma’s two parameters α, β and plugging them into the formula, let’s pause for a moment and ask a few questions…

Why did we have to invent the Gamma distribution? (i.e., why does this distribution exist?)

When should Gamma distribution be used for modeling?

1. Why did we invent Gamma distribution?

Answer: To predict the wait time until future events.

Hmmm ok, but I thought that’s what the exponential distribution is for.
Then, what’s the difference between exponential distribution and gamma distribution?

The exponential distribution predicts the wait time until the *very first* event. …


Conditional Independence Intuition, Derivation, and Examples

Sometimes the explanation in Wikipedia is not the easiest to understand.

Image for post
Image for post
From https://en.wikipedia.org/wiki/Conditional_independence

1. The intuition of Conditional Independence

Let’s say A is the height of a child and B is the number of words that the child knows. It seems when A is high, B is high too.

There is a single piece of information that will make A and B completely independent. What would that be?

The child’s age.

The height and the # of words known by the kid are NOT independent, but they are conditionally independent if you provide the kid’s age.

2. Mathematical Form


Its examples and properties

If you have Googled “Moment Generating Function” and the first, the second, and the third results haven’t had you nodding yet, then give this article a try.

1. First things first — What is the “Moment” in probability/statistics?

Let’s say the random variable we are interested in is X.

The moments are the expected values of X, e.g., E(X), E(X²), E(X³), … etc.

The first moment is E(X),

The second moment is E(X²),

The third moment is E(X³),

The n-th moment is E(X^n).

We are pretty familiar with the first two moments, the mean μ = E(X) and the variance E(X²) − μ². They are important characteristics of X. The mean is the average value and the variance is how spread out the distribution is. But there must be other features as well that also define the distribution. For example, the third moment is about the asymmetry of a distribution. …

About

Aerin Kim

I’m an Engineering Manager at Scale AI and this is my notepad for Applied Math / CS / Deep Learning topics. Follow me on Twitter for more!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store