I don’t think we are generally doing a good job of teaching statistics to students. Because of this, many people possess very spotty understanding of the field and treat it with suspicion. For them, statistics is a bunch of tricks that let you determine whether something is ``significant’’ or not, and, in the latter case, it might as well be zero. Somehow the big picture of statistics is frequently missing from most introductory courses, and it took me a while to reconstitute it back together. What follows is my attempt to write it down for posterity (and others).

A statistician’s worldview starts with the concept of a *data generating process*,
or DGP for short.
The DGP is an unverifiable assumption of how the world works.
For an admittedly silly example, we might be interested in understanding
how a person’s height affects their wage.
A possible DGP for this problem would look like this:

\[
w = \alpha + \beta \cdot h + \varepsilon,
\]
where \(w\) is person’s wage, \(h\) is their height, \(\varepsilon\) represents all
other factors that affect wage other than person’s height, and \(\alpha\) and
\(\beta\) are the only two unknown parameters in this DGP equation.
I am deliberately keeping this simple: one can easily postulate vastly more
complex DGPs that attempt to relate height to wages.
For our purposes we just need to have *a* DGP, no matter how silly.

The DGP is fundamentally unfalsifiable and is often assumed to be linear, especially at the early stages of analysis. Standard arguments about local approximations via Taylor expansions are usually brought up to justify the linearity, and often fitting linear models well is hard enough. But the important aspect is that by assuming the DGP we have reduced the dimensionality of the problem dramatically: now we only have the two unknown parameter values \(\alpha\) and \(\beta\) to worry about.

Next, we need a dataset to which we will apply statistical techniques. Only Bayesians can work without data, and even they prefer to have at least some of it to update their priors. For our example we would need a bunch of measurements about people’s wages and heights, and we would need to have both of these measurements for every person in our dataset. For example, you have asked everyone in your office building about how tall they are and how much they get paid, and somehow miraculously was able to get a bunch of people to answer your questions.

Statisticians would think of the dataset as a *sample* that was drawn
from the *population* of all humans: everyone has a height and most people
have wages (and we can assume zero wages for those who are not employed).
We will sidestep the issues of how representative our sample is for now,
and will only note that a representative sample is generally much more
useful for what we hope to achieve.

Once we have a sample of \((w_i, h_i)\) for \(I\) people, we typically rely on our
DGP assumption to estimate \(\alpha\) and \(\beta\).
Let’s focus on the simple case of using least squares, or OLS, to obtain
*estimates*, usually denoted \(\hat{\alpha}\) and \(\hat{\beta}\).
(Many alternative estimation procedures exist, each of them will result in
separate *estimators* and estimates, but it is out of scope for this writeup.)
For our purposes it is important to grok that \(\hat{\alpha}\) and \(\hat{\beta}\)
are *random functions* of the sample values of \(w_i\) and \(h_i\).
A different sample, even of the same size, will almost certainly result in
different values of \(\hat{\alpha}\) and \(\hat{\beta}\), even though we use the
same estimator to obtain them.
Imagine one sample being taken over a group of professional basketball players,
while the other sample uses janitors in the team’s home sports arena.

Because we used a well-understood estimator to obtain \(\hat{\alpha}\) and
\(\hat{\beta}\), we know a lot about its statistical properties.
For example, we know that as \(I\) approaches infinity, both \(\hat{\alpha}\) and
\(\hat{\beta}\) have sampling distributions that become indistinguishable from
normal, or, as statisticians like to say, these estimators have asymptotic
normal distributions.
We use these distributions to perform *inference*, which is a fancy word that
means “testing hypotheses about the population using the sample at hand”.

Perhaps the most common inference would be to test a hypothesis that \(\beta = 0\), which would imply no relationship between height and wages. This hypothesis can be tested by computing the t-statistic via the well-known formula:

\[ \frac{\hat{\beta} - \beta_0}{\mathtt{se}(\hat{\beta}) },\]
where \(\mathtt{se}(\hat{\beta})\) is the *standard error*, or our estimate of
the standard deviation of \(\hat{\beta}\)’s sampling distribution.
With \(\beta_0 = 0\) the formula simplifies to the ratio of the estimate to its
standard error.
Recall that \(\hat{\beta}\) has an asymptotic normal distribution.
Given the assumption that the null hypothesis \(\hat{\beta} = \beta_0\) is true,
we have \(\hat{\beta} \sim_a \mathcal{N}(0,\,\mathtt{se}(\hat{\beta})^2)\), and
hence the t-statistic will have an asymptotic standard normal distribution
\(\mathcal{N}(0,\,1)\).

Up until this point, we had been simply computing numbers using the sample,
but now comes the actual inference part.
We can make a judgment call of how probable it is to observe our particular
value of t-statistic when we know it is supposed to be a realization of
a standard normal random variable.
Usually if its absolute value is less than `1.96`

, we conclude that with 95%
certainty we cannot reject the assumption that \(\beta = 0\), even if our
sample estimate \(\hat{\beta}\) is far from zero numerically.

Because every sample is in practice finite, all the confidence intervals and p-values derived from asymptotic inferences are at least a little bit wrong. An asymptotic 95% confidence interval can be 92% or 97% in our particular sample, or it could be even less accurate if our \(I\) is very far from infinity. But the part of statistical inference process to which many machine learning practitioners tend to object is the DGP assumption that cannot be verified. More specifically, they correctly point out that all of the above machinery hinges on us knowing the functional form of the DGP, and the moment this assumption is violated, all our inferences go out the window. According to them, if we cannot reject the hypothesis that \(\beta = 0\), it is because the ``true’’ relationship between \(w\) and \(h\) is not linear, but is instead considerably more complicated? In our example it’s easy to propose improvements to the DGP, for example, taking people’s gender into account will likely result in smaller estimates of \(\hat{\beta}\), simply because men are on average taller and happen to get paid more than women.

Is there a solution to misspecified DGP?
There is, and it is called the bootstrap, although it requires a computer for
all practical purposes, which is not really a big constraint these days.
The essence of the bootstrap approach is as follows: you admit that you likely
do not know the correct distribution of your estimator, so you attempt to
approximate it with the *empirical distribution* which is obtained by drawing
a large number, e.g. `1000`

, of bootstrap samples of size \(I\) from your sample,
with replacement.
(If you draw without replacement, all bootstrap samples will be identical.)
For each sample, you construct the value of your test statistic, such as the
t-statistic we used above.
Once you have a `1000`

values of the t-statistic, you can compare the value
computed off the original sample with this distribution and infer how likely
this observed value is under the null hypothesis.
Bootstrap is especially useful when your statistic of interest is a complicated
function of data, and deriving its asymptotic properties is hard.
Tim Hesterberg from Google wrote an excellent paper about bootstrap that I highly recommend reading, but it is more
technical than the above writeup.