There are a lot of engineers who have never been involved in the field of statistics or data science. But in order to build data science pipelines or rewrite produced code by data scientists to an adequate, easily maintained code many nuances and misunderstandings arise from the engineering side. For those Data/ML engineers and novice data scientists, I make this series of posts. I'll try to explain some basic approaches in plain English and, based on it, explain some of the Data Science basic concepts.

The whole series:

In statistics, what we see in our data is interesting not to ourselves, but to assess where it comes from (population). To evaluate and describe characteristics, we need to know a couple of things. First, the values of these characteristics, which are typical for the distribution under study. Secondly, as far as these values are different (scattered), how much they are typical for such a distribution.

The first task is solved by measures of the central tendency. The second task is solved by measures of variation.

Measures of Central Tendency

To study the various measures of the central tendency, we will create the data, let them be in the format of a story.

Imagine a situation where 5 people are sitting in a bar with an income of 40,000. The average income of people sitting in a bar is 40,000.

data = [40, 40, 40, 40, 40]

Using this data we will now go to the theory.

Mean

Probability theory is related to the study of random variables. For this purpose, various characteristics are calculated that describe their behavior. One of the main characteristics of a random variable is the mean (others call it also mathematical expectation or average), which is a kind of average value of available data.

The mean gives us a generalized estimate and can help us make decisions. For example, the average income in a company gives us an approximate estimate of how many new employees we can hire, the average check in a restaurant helps us decide whether to go there or not.

The mean is a simple measure and has the following formula:

$$ mean = \sum\limits_{i = 1}^n {x_i p_i} $$

where $# x_i #$ – random variables, $# p_i #$ – their probabilities

As can be seen from the formula, the average value of a random variable is a weighted sum of values where weights are equal to the corresponding probabilities.

For example, if you calculate the mean value of the sum of points when throwing two dice, you get the number 7. But here we know exactly all possible accepted values and their probabilities. And what if there is no such information? There are only the results of some observations. What's to be done? The answer comes from statistics, which allows us to get an approximate value of the mean, to estimate it from the available data.

Mathematical statistics provides several options for estimating the mean. The main one is arithmetic mean, with some useful properties. For example, arithmetic mean is an unbiased estimation, i.e. average expectation equals estimated expectation.

Thus, the arithmetic mean is calculated by a formula that is known to any student.

$$ mean = \frac{1}{n} \sum\limits_{i = 1}^n {x_i} $$

where $#x_i#$ – random variables, n – number of values.

def mean(x):
    return sum(x) / len(x)

mean(data)  # 40

The disadvantage of this measure is the sensitivity to different deviations and outliers in the sample, i.e. it is vulnerable to significant distortions from the outliers that deviate drastically from the distribution center.

If any other person comes to our imaginary bar with an income of $40k, the average income of people in the bar will stay the same.

data1 = data + [40]
mean(data1)  # 40

If Jeff Bezos walks into a bar with $10 billion in revenue the average income in the bar goes up to 1700, although the income of the first 4 people has not changed.

data2 = data + [10000]
mean(data2)  # 1700

To deal with this problem, there are other measures of the central tendency: truncated mean, mode, and median.

Mode

Another simple measure. Mode is simply the most frequent value in a dataset.

There may be several modes. And the presence of multiple modes is also a certain characteristic of the data to be observed. This is an indication that the data has some internal structure, that there may be some subgroups that are qualitatively different from each other. And perhaps it makes sense not to look at this distribution as a whole but to divide it into subgroups and look at them separately.

def mode(x):
    """returns a list, might be more than one mode"""
    counts = Counter(x)
    max_count = max(counts.values())
    return [x_i for x_i, count in counts.iteritems() if count == max_count]

mode(data)  # [40]

Mode is indispensable for qualitative variables and is of little use for quantitative ones. It also helps us to estimate the most typical value of the data sample.

Median

The central tendency can be considered not only as a value with zero total deviation (arithmetic mean) or maximum frequency (mode), but also as a certain mark (certain level of the analyzed characteristic) dividing the ordered data into two equal parts. That is, half of the initial data in its value is less than this mark, and half is bigger. This is median.

Mode and median are important measures, they reflect the data structure and are sometimes used instead of arithmetic mean.

def median(v):
    """finds the 'middle-most' value of v"""
    n = len(v)
    sorted_v = sorted(v)
    midpoint = n // 2
    if n % 2 == 1:
        # if odd, return the middle value
        return sorted_v[midpoint]
    else:
        # if even, return the average of the middle values
        lo = midpoint - 1
        hi = midpoint
        return (sorted_v[lo] + sorted_v[hi]) / 2
    
median(data)  # 40
median(data2)  # 40

Obviously, in symmetrical distribution, the middle separating the sample in half will be in the very center - in the same place as the mean and mode. It is, so to speak, an ideal situation when mode, median, and mean coincide. However, life is not as symmetrical as the normal distribution.

Measures of Variability(Dispersion)

To understand how data sampling characteristics behave, it is not enough to know the mean, it is not enough to know the typical values of these characteristics, you also need to know their variability. That is, we must not only know what is typical but also know how much different from the mean the values are, how much different the values that are not similar to it are. And for that, we have the measures of variation.

Let's get back to our imaginary situation. Imagine that we now have two bars:

data1 = [40, 40, 40, 40, 40]
data2 = [80, 40, 15, 25, 40]

mean(data1)  # 40
mean(data2)  # 40
median(data1)  # 40
median(data2)  # 40
mode(data1)  # [40]
mode(data2)  # [40]

They seem to be similar in characteristics we looked at, but the data are actually different.

Range

The simplest and most straightforward measurement of variability is the range. The range is the distance between the minimum and maximum characteristic value.

def data_range(x):
    return max(x) - min(x)

data_range(data1)  # 0
data_range(data2)  # 65

On the one hand, the range can be very informative and useful. For example, the difference between the maximum and minimum price of an apartment in a city, the difference between the maximum and minimum wages in a region, and so on. On the other hand, the range can be very large and have no practical sense.

This measure shows how much the values in the sample vary, but it does not tell us anything about the distribution itself.

Variance

If the mean reflects the center of a random value, the variance gives the characteristic of the data spread around the center and takes into account the influence of values of all objects.

The formula for variance is the following:

$$ s^2 =\frac{1}{n}\sum\limits_{i = 1}^n {\left( {x_i - \bar x} \right)^2 } $$

where x – random variables, $# \bar x #$ – mean value, n – number of values.

For each value, we will take a deviation from the average, erect them in a square, and then divide them by the number of values in the sample.

Why do we square it up?

The sum of negative and positive variations will give zero because negative variations and positive variations mutually cancel each other. To avoid this mutual cancelation, the square of this measure in the numerator is used. As for the denominator, we divide it by n. However, using values different from n improves the estimation in different ways. The total value for denominator n - 1, this eliminates the bias.

def de_mean(x):
    """translate x by subtracting its mean (so the result has mean 0)"""
    x_bar = mean(x)
    return [x_i - x_bar for x_i in x]

def sum_of_squares(y):
    """the total squared variation of y_i's from their mean"""
    return sum(v ** 2 for v in de_mean(y))

def variance(x):
    """assumes x has at least two elements"""
    n = len(x)
    deviations = de_mean(x)
    return sum_of_squares(deviations) / (n - 1)

variance(data1)  # 0
variance(data1)  # 612

Thus, we take into account each deviation, and the sum divided by the number of objects gives us an estimate of variability.

What's the problem here?

The fact that we square up gives us multiple increases in measurement. That is, if in the first case with our salary we talk in dollars, in thousands of dollars, then when we square it, we start to operate in millions or even billions. And this becomes less informative in terms of specific wages that people in the organization get.

Standard deviation

To return the variance into reality, that is to use it for more practical purposes, the square root is extracted from it. It's a so-called standard deviation.

And that's the formula:

$$ s = \sqrt {\frac{1}{n}\sum\limits_{i = 1}^n {\left( {x_i - \bar x} \right)^2 } } $$

def standard_deviation(x):
    return math.sqrt(variance(x))

standard_deviation(data1)  # 0
standard_deviation(data2)  # 24.7

Standard deviation also characterizes the measure of variability, but now (as opposed to variance) it can be compared with the original data because they have the same units of measurement (this is clear from the calculation formula).

For example, there is the three-sigma Rule which states that normally distributed data have 997 values out of 1000 within ± 3 standard deviations from the mean. Standard deviation, as a measure of uncertainty, is also involved in many statistical calculations. It can be used to determine the accuracy of various estimates and forecasts. If the deviation is very large, then the standard deviation will also be large, so the forecast will also be inaccurate, which will be expressed, for example, in very wide confidence intervals.

Conclusion

These are basic measures that a data engineer should know, but not all of them. They will be used in the next sections of the series, where some new measures will be introduced.

Data Science. Measures