Data Science. The Central Limit Theorem and sampling

Data Science. The Central Limit Theorem and sampling

There are a lot of engineers who have never been involved in the field of statistics or data science. But in order to build a data science pipelines or rewrite produced code by data scientists to an adequate, easily maintained code many nuances and misunderstandings arise from the engineering side. For those Data/ML engineers and novice data scientists, I make this series of posts. I'll try to explain some basic approaches in plain English and, based on it, explain some of the Data Science basic concepts.

The whole series:


The practice of studying random phenomena shows that the results of individual observations, even those made under the same conditions, may differ. But the average results for a sufficiently large number of observations are stable and only slightly fluctuate depending on the results of individual observations. The theoretical basis for this remarkable property of random phenomena is the Central Limit Theorem(aka law of large numbers).

The average value of the data sample, according to the Central Limit Theorem, will be closer to the average of the whole population and will be approximately normal as the sample size increases. The significance of this theorem follows from the fact that this is true regardless of population distribution.

To illustrate the concept check the following animation of the die roll and code:

Die roll

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from wand.image import Image
from wand.display import display

# 1000 simulations of die roll
n, m = 200, 31

# In each simulation, there is one trial more than the previous simulation
avg = []
for i in range(2, n):
    a = np.random.randint(0, m, i)
    avg.append(np.average(a))


# Function that will plot the histogram, where current is the latest figure
def clt(current):
    # if animation is at the last frame, stop it
    plt.cla()
    if current == n: 
        a.event_source.stop()

    plt.xlim(0, m)
    plt.hist(avg[0:current])

    plt.gca().set_title('Expected value of die rolls')
    plt.gca().set_xlabel('Average from die roll')
    plt.gca().set_ylabel('Frequency')

    plt.annotate('Die roll = {}'.format(current), [3, 27])

fig = plt.figure()
a = animation.FuncAnimation(fig, clt, interval=1, save_count=n)
a.save('animation.gif', writer='imagemagick', fps=10)

In a practical world, to understand population characteristics, scientists usually sample data and work with sample statistics. They work with samples to understand and summarize population patterns. Using a large sample size, the Central Limit Theorem allows normal distribution properties to be applied in this process.

We already know that the normal distribution is special. We can also use some of its properties for distributions that, strictly speaking, cannot be called normal.

Sampling

Before that, we talked about how knowing the theoretical distribution, knowing the probability theory, knowing the distribution characteristics to estimate in the sample what happens to the whole population. But there is a problem that even the best theory and even the most known distribution will not help us - if the sample we are estimating is designed incorrectly or does not represent the general population.

Sampling concept

If we take a sample, we'll discard some data anyway. Why not just grab all the data and work with the whole population?

First, you need to collect data, and it's very expensive. Think about the polls. Even if we don't talk about the entire US population, but we want to represent, for example, the entire California population, that's 39 million people. To interview 39 million people, we need a budget, which of course we don't have. Besides, even if we have such a budget, it is almost impossible to cover all the residents of any state.

And the idea of sampling, in general, is simple - to take a small sample, but which will be quite heterogeneous, in terms of some key criteria that represent our general population. In other words, not to survey all the residents of California, but to take a piece that will represent California on the criteria that are important to us.

The idea of sampling is very similar to the idea of soup. When we make a soup that contains many ingredients, they are cut differently, added at different times, and we must assess the quality of the soup. We don't need to eat all the soup to assess how delicious it is(or not). Moreover, if we had to eat all the soup to assess how delicious it is, any idea of collective cooking would be somewhat absurd. But what do we do? We make the soup, and then we take a spoon, scoop it up and, based on this little portion, try to assess whether we have made the good soup or if we need to change something about it. If we just take some particular part, like we scoop from the top, we'll have a spoon full of water, and it won't give us any idea about the ingredients (vegetables or meat). If we scoop anywhere from the bottom, we may only get big pieces, but we didn't understand anything about small ones. In order to get all the ingredients of our soup into a sample on which we can taste the soup, we have to mix it first and then, after we mix it well, scoop it, and we see that then all the ingredients are in a spoon - big, small, and water, all. So we can already see on this portion how well all the ingredients in the soup are prepared. The same can be applied to sampling.

The analog of this mixing in the case of sampling is random sampling. It is a random data selection, the essence of which is to guarantee the same probability of obtaining each element of the population in the sample, it provides us with a representative sample.

What's so terrible about the sample not being representative?

Well, I'll focus on a few issues. The easiest and most understandable example is if we choose, for example, an available sample.

Available sample

For example, we study the preferences of young people, but since we study or work at a university, we interview only the students of our university. And we say that based on this research we learn about all young people. Obviously, we will not know about all the young people, because the sample is a very specific group. The available sample gives us some part of the picture, but not the complete picture. In this case, a significant proportion of young people who do not study in universities or do not study in other universities will not be covered by our sample. Another problem is that we can only choose those people who want to talk to us. That is, the problem with such unrepresentative samples is that we do not provide the same opportunities to different people, different views cannot be represented by our sample. A random sampling at least formally guarantees the possibility of such representation.

Probability sampling methods

Simple random sampling

The easiest and most straightforward method is simple random sampling, where we have a complete list of elements of the general population and we select elements randomly. For example, our population is all owners of phone numbers in New York City, and we have a full list of those numbers. We turn on a "random number sensor", select the number of objects we need and call those phone numbers - a simple random selection.

Stratified sampling

Another option is stratified sampling. Here we know something about our population - we know that it consists of several homogeneous clusters that should be represented in our sample. For example, men and women who have different views on certain issues. And we first divide our population into clusters of men and women, and then randomly select in both clusters to ensure that these clusters are represented in the final sample.

Cluster sampling

And another option is to sample cluster. Such samples are used, for example, in the city research, which is very often divided into districts. Some of the districts are similar to each other, some are different, and we have clusters of districts that are similar in, say, social and economic conditions. And we first divide the city into clusters, and then randomly choose one of these clusters, so that we don't go into all twelve districts, for example, and choose three of the twelve, randomly.

Non-probability sampling methods

Snowball sampling

Non-probability sampling methods are also needed. Moreover, non-probability sampling is irreplaceable in some cases. For example, there is a sampling method - snowball sampling. It is necessary, if we study hard-to-reach groups or if we don't know exactly the volume of the general population. So, it turns out that we talk to the first person, it connects us to the next person, it connects us to the next one, and we sort of accumulate snowball. We increase the sample starting with one element of the population, or sometimes we run several snowballs to ensure that the population is heterogeneous. But, of course, this sample is statistically unrepresentable, but there are tasks that without this method we simply can't solve.

Conclusion

The Central Limit Theorem is quite an important concept in statistics, and therefore in data science. This theorem will allow testing so-called [statistical hypotheses] (https://luminousmen.com/post/demystifying-hypothesis-testing), i.e. testing the applicability of assumptions to the general population.

Sampling is a cheap and understandable concept to obtain small but representative population data. Probabilistic sampling methods are preferable for most research tasks, but there are tasks in which only non-probability sampling can help. There are tasks for which they are indispensable, but in a statistical sense, non-probability samples are not representative. Therefore, all this theoretical knowledge about distributions, about building a general population on the basis of a sample, we can apply only on the basis of random samples.

Previous post Next post


Buy me a coffee