*The views expressed on this post are mine alone and do not reflect the views of my employer, Microsoft.*

**Text-to-SQL** is a task to translate a user’s query spoken in natural language into SQL automatically. It is the project that I’m working on at Microsoft.

If this problem is solved, it’s going to be widely useful because the vast majority of data in our lives is stored in relational databases. In fact, **Healthcare**, **financial services, and sales industries **exclusively use the relational database. This means the industries that can’t afford to lose transactions solely use the relational database. (You can risk losing social media comments here and there but you don’t want to risk losing transaction records of your credit card.) Also writing SQL queries can be prohibitive to non-technical users. Bill Gates noticed this problem and he himself(!) wrote down 105 questions (my team is working on 70 of them) that he wants a machine to be able to answer given enterprise databases. …

If you read any scientific papers, e.g. medical, artificial intelligence, climate, political, etc., or any poll result, there is a term that almost always appears — the p-value.

But what exactly is a p-value? Why does it show up in all these contexts?

This table lists the symptoms and their p-values when you are infected with the novel coronavirus (COVID-19).

The only remarks about this table from the author were *“Proportions for categorical variables were compared using the χ2 test. P values indicate differences between ICU (Intensive Care Unit) and non-ICU patients.”*

Let’s say all doctors in the hospital fell sick from the coronavirus and you(!) are in charge of triaging the patients who need to go to ICU. There are only **a limited number of beds in the ICU** so you can’t just admit everyone. The only data that you can refer to is this table. (Please note that there will be some other factors that physicians will consider when it comes to ICU admission. This is for a pedagogical illustration only.) …

The (somewhat vague) term “Operations Research” was coined during World War I. The British military brought together a group of scientists to allocate insufficient resources — for example, food, medics, weapons, troops, etc. — in the most effective way possible to different military **operations**. So the term “*operations*” is from “*military operations*”. Successfully conducting military operations was a huge deal and Operations Research (OR) became its own academic discipline in universities in the 40s.

When you google “Operations Research”, you get a very long Wikipedia article, however, the explanation is a little bit all over the place and to be honest, outdated as well. …

The Beta distribution is **a probability distribution on probabilities**. For example, we can use it to model the probabilities: the Click-Through Rate of your advertisement, the conversion rate of customers actually purchasing on your website, how likely readers will clap for your blog, how likely it is that Trump will win a second term, the 5-year survival chance for women with breast cancer, and so on.

Because the Beta distribution models a probability, its domain is bounded between **0 **and **1**.

**Let’s ignore** **the coefficient** **1/B(α,β) **for a moment and only look at the numerator** x^(α-1) * (1-x)^(β-1),** because **1/B(α,β)** is just a normalizing constant to make the function integrate to 1. …

Prior probability is **the probability of an event before we see the data**.

In Bayesian Inference, the prior is our guess about the probability based on what we know now, before new data becomes available.

Conjugate prior just can not be understood without knowing Bayesian inference.

For the rest of the blog, I’ll assume you know the concepts of prior, sampling and posterior.

**For some likelihood functions, if you choose a certain prior, **the posterior ends up being in the same distribution as the prior. Such a prior then is called a Conjugate Prior.

It is always best understood through examples. Below is the code to** calculate the posterior of the binomial likelihood. θ **is the probability of success and our goal is to pick the **θ that maximizes the posterior probability.** …

In one sentence: to **update the probability** **as we gather more data.**

The core of Bayesian Inference is to combine two different distributions (likelihood and prior) into one “smarter” distribution (posterior). Posterior is **“smarter” in the sense that the classic maximum likelihood estimation (MLE) doesn’t take into account a prior.** Once we calculate the posterior, we use it to find the “best” parameters and the **“best” is in terms of maximizing the posterior** **probability**, given the data. This process is called Maximum A Posteriori (MAP). …

**Why should I care?**

**Many probability distributions are defined by using the gamma function** — such as Gamma distribution, Beta distribution, Dirichlet distribution, Chi-squared distribution, and Student’s t-distribution, etc.For data scientists, machine learning engineers, researchers, the Gamma function is probably

Before setting Gamma’s two parameters *α, β** *and plugging them into the formula, let’s pause for a moment and ask a few questions…

Why did we have to invent the Gamma distribution? (i.e., why does this distribution exist?)

When should Gamma distribution be used for modeling?

**Answer: To predict the wait time until future events.**

Hmmm ok, but I thought that’s what the exponential distribution is for.

Then,what’s the difference between exponential distribution and gamma distribution?

The exponential distribution predicts the wait time until the ***very first*** event. …

Sometimes the explanation in Wikipedia is not the easiest to understand.

Let’s say **A** is **the height of a child** and **B** is **the number of words that the child knows**. It seems when **A** is high, **B** is high too.

There is **a single piece of information that will make A and B completely independent. **What would that be?

The child’s age.

The height and the # of words known by the kid are **NOT independent**, but they are **conditionally independent** if you provide the kid’s age.

If you have Googled “Moment Generating Function” and the first, the second, and the third results haven’t had you nodding yet, then give this article a try.

Let’s say the random variable we are interested in is **X**.

**The moments are the expected values of X, e.g., E(X), E(X²), E(X³), … etc.**

The first moment is **E(X)**,

The second moment is **E(X²)**,

The third moment is **E(X³),**

…

The n-th moment is **E(X^n)**.

We are pretty familiar with the first two moments, the mean **μ =** **E(X)** and the variance **E(X²) − μ²**. They are important characteristics of **X**.** **The mean is** **the average value and the variance is how spread out the distribution is. But there must be **other** **features** **as well** that also define the distribution. For example, the third moment is about the asymmetry of a distribution. …

About