Many engineers haven't had direct exposure to statistics or Data Science. Yet, when building data pipelines or translating Data Scientist prototypes into robust, maintainable code, engineering complexities often arise. For Data/ML engineers and new Data Scientists, I've put together this series of posts.
I'll explain core Data Science approaches in simple terms, building from basic concepts to more complex ones.
The whole series:
- Data Science. Probability
- Data Science. Bayes theorem
- Data Science. Probability Distributions
- Data Science. Measures
- Data Science. Correlation
- Data Science. The Central Limit Theorem and Sampling
- Data Science. Demystifying Hypothesis Testing
- Data Science. Data types
- Data Science. Descriptive and Inferential Statistics
- Data Science. Exploratory Data Analysis
In this post, we're going to explore two key concepts that help us make sense of random data: the Central Limit Theorem (CLT) and sampling.
The Central Limit Theorem (CLT)
The practice of studying random phenomena reveals that the results of individual observations, even those made under identical conditions, may differ. However, the average result from a sufficiently large number of observations tends to be stable, fluctuating only slightly with individual observation results. The Central Limit Theorem (sometimes referred to as the law of large numbers) provides the theoretical basis for this fascinating stability in random phenomena.
According to the Central Limit Theorem, the average of a sufficiently large data sample will approximate the population mean and take on a nearly normal distribution, regardless of the population's distribution. This theorem is significant because it holds true no matter the underlying population distribution.
To illustrate, consider the following animation of a die roll, along with sample code:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from wand.image import Image
from wand.display import display
# Simulate 1000 die rolls
n, m = 200, 31
# Each simulation trial adds one more roll than the previous
avg = []
for i in range(2, n):
a = np.random.randint(0, m, i)
avg.append(np.average(a))
# Function to plot the histogram, updating with each frame
def clt(current):
plt.cla()
if current == n:
a.event_source.stop()
plt.xlim(0, m)
plt.hist(avg[0:current])
plt.gca().set_title('Expected value of die rolls')
plt.gca().set_xlabel('Average from die roll')
plt.gca().set_ylabel('Frequency')
plt.annotate(f'Die roll = {current}', [3, 27])
fig = plt.figure()
a = animation.FuncAnimation(fig, clt, interval=1, save_count=n)
a.save('animation.gif', writer='imagemagick', fps=10)
In practice, scientists often work with sample statistics to understand population characteristics, since collecting data on entire populations is rarely feasible. A large sample size enables scientists to leverage Central Limit Theorem principles, effectively applying properties of a normal distribution to their sample analysis.
When discussing the stability achieved with larger sample sizes, it's important to recognize that "large" is relative. The required sample size for the sample mean to approximate a normal distribution varies based on the population's characteristics, particularly skewness and variance.
Misinterpretations of CLT
The Central Limit Theorem is often simplified to suggest that all sample means are normally distributed if the sample is "large enough". But this can lead to misunderstandings. Here are some common misconceptions:
-
Any sample size will yield a normal distribution: Not quite. The CLT only ensures a normal distribution as the sample size becomes sufficiently large. Small samples may not exhibit a normal distribution, especially if the population is skewed.
-
CLT applies to any kind of data: While the theorem allows approximation of population means, it assumes that data points are independently drawn and represent a simple random sample. Highly skewed populations or non-independent data (such as time-series data with autocorrelation) may not follow CLT as expected.
-
Single-sample misinterpretation: The CLT applies to the distribution of sample means from repeated sampling, not a one-time, isolated sample. This distinction is crucial to avoid overgeneralizing results from a single sample.
Sampling
As we discussed in probability distributions, probability theory, and distribution characteristics, we can estimate population traits based on a sample. However, even the best theories and most accurate distributions can't save us if the sample isn't well-designed or representative of the entire population.
If we take a sample, we'll discard some data anyway. Why not just grab all the data and work with the whole population?
First, you need to collect data, and it's very expensive. Think about polling. Even if we limit the scope to the entire California population — around 39 million people — interviewing everyone would require a huge budget. And even if we had this budget, covering every individual is nearly impossible.
The general idea of sampling is simple: take a small, heterogeneous sample that reflects the general population in terms of key criteria. For instance, instead of surveying every Californian, we take a sample representing California's population on the characteristics that matter most to us.
The idea of sampling is like tasting soup. A good soup has many ingredients, chopped differently and added at different times, but you don't need to eat the entire pot to know if it's good (or not). If we had to finish the whole soup just to assess its quality, cooking would be absurdly inefficient. Instead, we take a spoonful — based on this little portion, we try to assess if the soup turned out well or if it needs adjustment. If we scoop from the top, we'll get mostly broth; if we scoop from the bottom, we might only get big chunks. To sample the soup properly, we mix it, ensuring a spoonful captures the full range of ingredients — broth, vegetables, meat. Then, with one taste, we get a reasonable assessment. Sampling works similarly.
Available Sample
Let's say we're studying the preferences of young people but limit our sample to students from the university we attend or work at. If we then claim to understand the preferences of all young people based on this sample, we're missing a significant portion of the population — those who aren't in our university or even in any university. The available sample gives us part of the picture, but it's an incomplete one.
Another problem is that this sampling method only reaches people willing to talk to us. Unrepresentative samples like this fail to account for diverse viewpoints since not everyone has an equal chance of being included. Random sampling formally provides this equal opportunity, at least giving different perspectives a chance to be part of our findings.
Probability Sampling Methods
The simplest and most straightforward approach is simple random sampling, where we have a complete list of the population and select members randomly. For example, if we want a sample from all phone owners in New York City and have access to a list of all numbers, we can use a "random number selector" to pick as many numbers as we need to call. Voilà — simple random selection.
Another approach is stratified sampling, which is useful when we know that our population contains several homogeneous subgroups that need representation. For instance, if men and women have different views on a certain topic, we can divide the population into male and female groups and then randomly sample from both. This way, each group's perspective is accounted for in our final sample.
Finally, there's cluster sampling, often used in city research where populations are naturally divided into districts. For example, a city may have clusters of neighborhoods with similar socioeconomic characteristics. Instead of sampling from every neighborhood, we randomly choose a subset of clusters, like three out of twelve districts. This method is practical for large or geographically dispersed populations.
Non-Probability Sampling Methods
Sometimes, non-probability sampling methods are necessary. For example, snowball sampling is ideal for studying hard-to-reach groups or when the population size isn't well-defined. Here's how it works: we start with one person, who connects us to another, who then connects us to another, building a "snowball" of contacts. We can even start several snowballs to capture a broader and more varied sample.
Although this method is statistically unrepresentative, it's indispensable for certain studies, such as those involving hidden or difficult-to-access populations. Without it, we couldn't access some groups at all.
Conclusion
The Central Limit Theorem is quite an important concept in statistics, and therefore in data science. This theorem will allow testing so-called statistical hypotheses, i.e. testing the applicability of assumptions to the general population.
Sampling is a cheap and understandable concept to obtain small but representative population data. Probabilistic sampling methods are preferable for most research tasks, but there are tasks in which only non-probability sampling can help. There are tasks for which they are indispensable, but in a statistical sense, non-probability samples are not representative. Therefore, all this theoretical knowledge about distributions, about building a general population on the basis of a sample, we can apply only on the basis of random samples.
Additional materials
- Practical Statistics for Data Scientists by Peter Bruce, Andrew Bruce, Peter Gedeck
- Naked Statistics by Charles Wheelan