Improve your data quality using sampling distribution

Sampling Distribution of a Statistic

The term sampling distribution of a statistic refers to the distribution of some sample statistic, over many samples drawn from the same population. Much of classical statistics is concerned with making inferences from (small) samples to (very large) populations.

Typically, a sample is drawn with the goal of measuring something (with a sample statistic) or modeling something (with a statistical or machine learning model). Since our estimate or model is based on a sample, it might be in error—it might be different if we were to draw a different sample. We are therefore interested in how different it might be—a key concern is sampling variability. If we had lots of data, we could draw additional samples and observe the distribution of a sample statistic directly. Typically, we will calculate our estimate or model using as much data as are easily available, so the option of drawing additional samples from the population is not readily available.

Warning

It is important to distinguish between the distribution of the individual data points (the data distribution) and the distribution of a sample statistic (the sampling distribution).

The distribution of a sample statistic such as the mean is likely to be more regular and bell-shaped than the distribution of the data themselves. The larger the sample that the statistic is based on, the more this is true. Also, the larger the sample, the narrower the distribution of the sample statistic.

This is illustrated in the following example using annual income for loan applicants to Lending Club. Take three samples from this data: a sample of 1,000 values, a sample of 1,000 means of 5 values, and a sample of 1,000 means of 20 values. Then plot a histogram of each sample to produce Figure 1-1.

Loans histogram
Figure 1-1. Histogram of annual incomes of 1,000 loan applicants

The histogram of the individual data values is broadly spread out and skewed toward higher values as is to be expected with income data. The histograms of the means of 5 and 20 are increasingly compact and more bell-shaped. Here is the R code to generate these histograms, using the visualization backage ggplot2.

library(ggplot2)
loans_income <- read.csv("/Users/andrewbruce1/book/loans_income.csv")[,1]
# take a simple random sample
samp_data <- data.frame(income=sample(loans_income, 1000),
                        type='data_dist')
# take a sample of means of 5 values
samp_mean_05 <- data.frame(
  income = tapply(sample(loans_income, 1000*5),
                  rep(1:1000, rep(5, 1000)), FUN=mean),
  type = 'mean_of_5')
# take a sample of means of 20 values
samp_mean_20 <- data.frame(
  income = tapply(sample(loans_income, 1000*20),
                  rep(1:1000, rep(20, 1000)), FUN=mean),
  type = 'mean_of_20')
# bind the data.frames and convert type to a factor
income <- rbind(samp_data, samp_mean_05, samp_mean_20)
income$type = factor(income$type,
                     levels=c('data_dist', 'mean_of_5', 'mean_of_20'),
                     labels=c('Data', 'Mean of 5', 'Mean of 20'))
# plot the histograms
ggplot(income, aes(x=income)) +
  geom_histogram(bins=40) +
  facet_grid(type ~ .)

Central Limit Theorem

This phenomenon is termed the Central Limit Theorem. It says that the means drawn from multiple samples will be shaped like the familiar bell-shaped normal curve, even if the source population is not normally distributed, provided that the sample size is large enough and the departure of the data from normality is not too great. The Central Limit Theorem allows normal-approximation formulas like the t-distribution to be used in calculating sampling distributions for inference, i.e., confidence intervals and hypothesis tests.

The Central Limit Theorem receives much attention in traditional statistics texts because it underlies the machinery of hypothesis tests and confidence intervals, which themselves consume half the space in such texts. Data scientists should be aware of this role, but, since formal hypothesis tests and confidence intervals play a small role in data science, and the bootstrap is available in any case, the Central Limit Theorem is not so central in the practice of data science.

Standard Error

The standard error is a single metric that sums up the variability in the sampling distribution for a statistic. The standard error can be estimated using a statistic based on the standard deviation s of the sample values, and the sample size n:

As the sample size increases, the standard error decreases, corresponding to what was observed in Figure 1-1. The relationship between standard error and sample size is sometimes referred to as the square-root of n rule: in order to reduce the standard error by a factor of 2, the sample size must be increased by a factor of 4.

The validity of the standard error formula arises from the central limit theorem. In fact, you don’t need to rely on the central limit theorem to understand standard error. Consider the following approach to measure standard error:

  1. Collect a number of brand new samples from the population.

  2. For each new sample, calculate the statistic (e.g., mean).

  3. Estimate the standard error by the standard deviation of the statistics computed in step 2.

In practice, the above approach of collecting new samples to estimate the standard error is typically not feasible (and statistically very wasteful). Fortunately, it turns out that it is not necessary to draw brand new samples; instead it is possible to use bootstrap resamples (see “The Bootstrap”). In modern statistics, the bootstrap has become the standard way to estimate standard error. It can be used for virtually any statistic and does not rely on the central limit theorem or other distributional assumptions.

Standard Deviation vs. Standard Error

Do not confuse standard deviation (which measures the variability of individual data points) with standard error (which measures the variability of a sample metric).

Further Reading

  1. David Lane’s online multimedia resource in statistics has a useful simulation that allows you to select a sample statistic, a sample size, and number of iterations and visualize a histogram of the resulting frequency distribution.

The Bootstrap

One easy and effective way to estimate the sampling distribution of a statistic, or of model parameters, is to draw additional samples, with replacement, from the sample itself, and recalculate the statistic or model for each resample. This procedure is called the bootstrap, and it does not necessarily involve any assumptions about the data, or the sample statistic, being normally distributed.

Conceptually, you can imagine the bootstrap as replicating the original sample thousands or millions of times so that you have a hypothetical population that embodies all the knowledge from your original sample (it’s just larger). You can then draw samples from this hypothetical population for the purpose of estimating a sampling distribution.

images/Bootstrap-schematic-1.png
Figure 1-2. The idea of the bootstrap

In practice, it is not necessary to actually replicate the sample a huge number of times. We simply replace each observation after each draw—we sample with replacement. In this way, we effectively create an infinite population in which the probability of an element being drawn remains unchanged from draw to draw. The algorithm for a bootstrap resampling of the mean is as follows, for a sample of size N:

  1. Draw a sample value, record, replace it

  2. Repeat N times

  3. Record the mean of the N resampled values

  4. Repeat steps 1-3 B times

  5. Use the B results to:

    1. Calculate their standard deviation (this estimates sample mean standard error)

    2. Produce a histogram or boxplot

    3. Find a confidence interval

B, the number of iterations of the bootstrap, is set somewhat arbitrarily. The more iterations you do, the more accurate the estimate of the standard error, or the confidence interval.

The bootstrap can be used with multivariate data, where the rows are sampled as units (see Figure 1-3). A model might then be run on the bootstrapped data, for example, to estimate the stability (variability) of model parameters, or to improve predictive power. With classification and regression trees (also called decision trees), running multiple trees on bootstrap samples and then averaging their predictions (or, with classification, taking a majority vote) generally performs better than using a single tree. This process is called bagging (short for “bootstrap aggregating”).

images/Bootstrap-multivariate-schematic.png
Figure 1-3. Multivariate bootstrap sampling

The repeated resampling of the bootstrap is conceptually simple, and Julian Simon, an economist and demographer, published a compendium of resampling examples, including the bootstrap, in his 1969 text Basic Research Methods in Social Science. However, it is also computationally intensive, and was not a feasible option before the widespread availability of computing power. The technique gained its name and took off with the publication of several journal articles and a book by Stanford statistician Bradley Efron in the late 1970s and early 1980s. It was particularly popular among researchers who use statistics but are not statisticians, and for use with metrics or models where mathematical approximations are not readily available. The sampling distribution of the mean has been well established since 1908; the sampling distribution of many other metrics has not. The bootstrap can be used for sample size determination—experiment with different values for N to see how the sampling distribution is affected.

The bootstrap met with considerable skepticism when it was first introduced; it had the aura to many of spinning gold from straw. This skepticism stemmed from a misunderstanding of the bootstrap’s purpose.

Warning

The bootstrap does not compensate for a small sample size—it does not create new data, nor does it fill in holes in an existing dataset. It merely informs us about how lots of additional samples would behave, when drawn from a population like our original sample.

Resampling versus Bootstrapping

Sometimes the term resampling is used synonymously with the term bootstrapping, as outlined above. More often, the term resampling also includes permutation procedures, where multiple samples are combined, and the sampling may be done without replacement. In any case, the term bootstrap always implies sampling with replacement from an observed dataset.

Further Reading

  1. An Introduction to the Bootstrap by Efron and Tibshirani (Chapman Hall, 1993); the first book-length treatment of the bootstrap, and still widely read.

Confidence Intervals

Frequency tables, histograms, boxplots, and standard errors are all ways to understand the potential error in a sample estimate. Confidence intervals are another.

There is a natural human aversion to uncertainty—people (especially experts) say “I don’t know” far too rarely. Analysts and managers, while acknowledging uncertainty, nonetheless place undue faith in an estimate when it is presented as a single number (a point estimate). Presenting an estimate not as a single number but as a range is one way to counteract this tendency. Confidence intervals do this in a manner grounded in statistical sampling principles.

Confidence intervals always come with a coverage level, expressed as a (high) percentage, say 90% or 95%. One way to think of a 90% confidence interval is as follows: it is the interval that encloses the central 90% of the bootstrap sampling distribution of a sample statistic. More generally, an x% confidence interval around a sample estimate should, on average, contain similar sample estimates x% of the time (when a similar sampling procedure is followed).

Given a sample of size n, and a sample statistic of interest, the algorithm for a bootstrap confidence interval is as follows:

  1. Draw a random sample of size n with replacement from the data (a resample)

  2. Record the statistic of interest for the resample

  3. Repeat steps 1-2 many times, call it B times

  4. For an x% confidence interval, trim [(1–[x/100])/2]% of the B resample results from either end of the distribution.

  5. The trim points are the endpoints of an x% bootstrap confidence interval

Figure 1-4 shows a a 90% confidence interval for the mean annual income of loan applicants, based on a sample of 20 for which the mean was $57,573.

images/bootstrap-CI.png
Figure 1-4. Bootstrap confidence interval for the annual income of loan applicants, based on a sample of 20

The bootstrap is a general tool that can be used to generate confidence intervals for most statistics, or model parameters. Statistical textbooks and software, with roots in over a half-century of computerless statistical analysis, will also reference confidence intervals generated by formulas, especially the t-distribution.

Note

Of course, what we are really interested in when we have a sample result is “what is the probability that the true value lies within a certain interval?” This is not really the question that a confidence interval answers, but it ends up being how most people interpret the answer.

The probability question associated with a confidence interval starts out with the phrase “Given a sampling procedure and a population, what is the probability that…” To go in the opposite direction, “Given a sample result, what is the probability that (something is true about the population),” involves more complex calculations and deeper imponderables.

The percentage associated with the confidence interval is termed the level of confidence. The higher the level of confidence, the wider the interval. Also, the smaller the sample, the wider the interval (i.e., the more uncertainty). Both make sense: the more confident you want to be, and the less data you have, the wider you must make the confidence interval to be sufficiently assured of capturing the true value.

Note

For a data scientist, a confidence interval is a tool to get an idea of how variable a sample result might be. Data scientists would use this information not to publish a scholarly paper or submit a result to a regulatory agency (as a researcher might), but most likely to communicate the potential error in an estimate, and, perhaps, learn whether a larger sample is needed.

Further Reading

  1. For a bootstrap approach to confidence intervals see Introductory Statistics and Analytics: A Resampling Perspective by Peter Bruce (Wiley, 2014) or Statistics by Robin Lock and four other Lock family members (Wiley, 2012).

  2. Engineers, with a need to understand the precision of their measurements, use confidence intervals perhaps more than most disciplines, and Modern Engineering Statistics by Tom Ryan (2007, Wiley) discusses confidence intervals. It also reviews a tool that is just as useful and gets less attention: prediction intervals (intervals around a single value, as opposed to a mean or other summary statistic).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset