Home
Tags Projects About
Data Science. Measures

Data Science. Measures

Many engineers haven't had direct exposure to statistics or Data Science. Yet, when building data pipelines or translating Data Scientist prototypes into robust, maintainable code, engineering complexities often arise. For Data/ML engineers and new Data Scientists, I've put together this series of posts.

I'll explain core Data Science approaches in simple terms, building from basic concepts to more complex ones.

The whole series:


In statistics, our goal isn't just to describe our data but to understand what it tells us about the larger group (population) it represents. To evaluate and describe data characteristics accurately, we need two things:

  1. Knowing which values are typical for the distribution.
  2. Understanding how typical these values are compared to other potential values.

The first part is addressed by measures of central tendency; the second, by measures of variation.

Measures of Central Tendency

To explore these central tendency measures, let's use a simple but relatable scenario:

Imagine a bar with five patrons, each making $40,000 a year. So, the average income in this bar is also $40,000.

data = [40, 40, 40, 40, 40]

With this data, let's jump into the theory.

Mean

In probability theory, the mean (also known as the expected value or average) represents the center point of a set of values. Calculating the mean is a way to generalize a dataset into a single value, which often helps in decision-making. For example, knowing the average income in a company helps us budget for growth, and knowing the average check at a restaurant informs our spending choices.

The mean is a simple measure and has the following formula:

$$ mean = \sum\limits_{i = 1}^n {x_i p_i} $$

Here, $# x_i #$ represents random variables, and $# p_i #$ is the probability of each variable.

As can be seen from the formula, the average value of a random variable is a weighted sum of values where weights are equal to the corresponding probabilities.

For example, if you calculate the mean value of the sum of points when throwing two dice, you get the number 7. But here we know exactly all possible accepted values and their probabilities. And what if there is no such information? There are only the results of some observations. What's to be done? The answer comes from statistics, which allows us to get an approximate value of the mean, to estimate it from the available data.

Mathematical statistics provides several options for estimating the mean. The main one is arithmetic mean, with some useful properties. For example, arithmetic mean is an unbiased estimation, i.e. average expectation equals estimated expectation.

In most cases, we don't have exact probabilities for each value, so we approximate using the arithmetic mean:

$$ mean = \frac{1}{n} \sum\limits_{i = 1}^n {x_i} $$

where $#x_i#$ represents values in the dataset, and n is the total number of values.

def mean(x):
    return sum(x) / len(x)

mean(data)  # 40

The mean is highly sensitive to outliers. If our imaginary bar had one more person join with the same income, the mean wouldn't change.

data1 = data + [40]
mean(data1)  # 40

But what if Jeff Bezos entered the bar with a net worth of $10 billion? Suddenly, the average income skyrockets to $1.7 million, even though most people are still making $40,000.

data2 = data + [10000]
mean(data2)  # 1700

This is why we sometimes use other measures, like the truncated mean, mode, and median.

Mode

Another simple measure. Mode is simply the most frequent value in a dataset. The mode is especially useful for data where a specific value or category is more common than others.

There may be several modes. And the presence of multiple modes is also a certain characteristic of the data to be observed. This is an indication that the data has some internal structure, that there may be some subgroups that are qualitatively different from each other. And perhaps it makes sense not to look at this distribution as a whole but to divide it into subgroups and look at them separately.

def mode(x):
    """returns a list, might be more than one mode"""
    counts = Counter(x)
    max_count = max(counts.values())
    return [x_i for x_i, count in counts.iteritems() if count == max_count]

mode(data)  # [40]

Mode is indispensable for qualitative variables and is of little use for quantitative ones. It also helps us to estimate the most typical value of the data sample.

Median

The central tendency can be considered not only as a value with zero total deviation (arithmetic mean) or maximum frequency (mode), but also as a certain mark (certain level of the analyzed characteristic) dividing the ordered data into two equal parts. That is, half of the initial data in its value is less than this mark, and half is bigger. This is median.

For datasets with outliers, the median is often more representative than the mean.

def median(v):
    """finds the 'middle-most' value of v"""
    n = len(v)
    sorted_v = sorted(v)
    midpoint = n // 2
    if n % 2 == 1:
        # if odd, return the middle value
        return sorted_v[midpoint]
    else:
        # if even, return the average of the middle values
        lo = midpoint - 1
        hi = midpoint
        return (sorted_v[lo] + sorted_v[hi]) / 2
    
median(data)  # 40
median(data2)  # 40

In a perfectly symmetrical distribution, the mean, median, and mode coincide. However, real-world data is rarely symmetrical, which is why these different measures can each be useful.

Measures of Variability (Dispersion)

To understand how data sampling characteristics behave, it is not enough to know the mean, it is not enough to know the typical values of these characteristics, you also need to know their variability. That is, we must not only know what is typical but also know how much different from the mean the values are, how much different the values that are not similar to it are. And for that, we have the measures of variation.

Let's get back to our imaginary situation. Imagine that we now have two bars:

data1 = [40, 40, 40, 40, 40]
data2 = [80, 40, 15, 25, 40]

mean(data1)  # 40
mean(data2)  # 40
median(data1)  # 40
median(data2)  # 40
mode(data1)  # [40]
mode(data2)  # [40]

They seem to be similar in characteristics we looked at, but the data are actually different.

Range

The range is the simplest measure of variability, calculated as the difference between the maximum and minimum values in the data. Although it's quick to calculate, the range can be heavily influenced by outliers.

def data_range(x):
    return max(x) - min(x)

data_range(data1)  # 0
data_range(data2)  # 65

On the one hand, the range can be very informative and useful. For example, the difference between the maximum and minimum price of an apartment in a city, the difference between the maximum and minimum wages in a region, and so on. On the other hand, the range can be very large and have no practical sense.

This measure shows how much the values in the sample vary, but it does not tell us anything about the distribution itself.

Variance

If the mean reflects the center of a random value, the variance gives the characteristic of the data spread around the center and takes into account the influence of values of all objects.

The formula for variance is the following:

$$ s^2 =\frac{1}{n}\sum\limits_{i = 1}^n {\left( {x_i - \bar x} \right)^2 } $$

where x – random variables, $# \bar x #$ – mean value, n – number of values.

For each value, we will take a deviation from the average, erect them in a square, and then divide them by the number of values in the sample.

Why do we square it up?

The sum of negative and positive variations will give zero because negative variations and positive variations mutually cancel each other. To avoid this mutual cancelation, the square of this measure in the numerator is used. As for the denominator, we divide it by n. However, using values different from n improves the estimation in different ways. The total value for denominator n - 1, this eliminates the bias.

def de_mean(x):
    """translate x by subtracting its mean (so the result has mean 0)"""
    x_bar = mean(x)
    return [x_i - x_bar for x_i in x]

def sum_of_squares(y):
    """the total squared variation of y_i's from their mean"""
    return sum(v ** 2 for v in de_mean(y))

def variance(x):
    """assumes x has at least two elements"""
    n = len(x)
    deviations = de_mean(x)
    return sum_of_squares(deviations) / (n - 1)

variance(data1)  # 0
variance(data1)  # 612

Thus, we take into account each deviation, and the sum divided by the number of objects gives us an estimate of variability.

What's the problem here?

The fact that we square up gives us multiple increases in measurement. That is, if in the first case with our salary we talk in dollars, in thousands of dollars, then when we square it, we start to operate in millions or even billions. And this becomes less informative in terms of specific wages that people in the organization get.

Standard deviation

To return the variance into reality, that is to use it for more practical purposes, the square root is extracted from it. It's a so-called standard deviation.

And that's the formula:

$$ s = \sqrt {\frac{1}{n}\sum\limits_{i = 1}^n {\left( {x_i - \bar x} \right)^2 } } $$

def standard_deviation(x):
    return math.sqrt(variance(x))

standard_deviation(data1)  # 0
standard_deviation(data2)  # 24.7

Standard deviation also characterizes the measure of variability, but now (as opposed to variance) it can be compared with the original data because they have the same units of measurement (this is clear from the calculation formula).

For example, there is the three-sigma Rule which states that normally distributed data have 997 values out of 1000 within ± 3 standard deviations from the mean. Standard deviation, as a measure of uncertainty, is also involved in many statistical calculations. It can be used to determine the accuracy of various estimates and forecasts. If the deviation is very large, then the standard deviation will also be large, so the forecast will also be inaccurate, which will be expressed, for example, in very wide confidence intervals.

Conclusion

These foundational concepts — mean, median, mode, variance, and standard deviation — are essential for data engineers, helping you understand data behavior at a high level. We'll build on these in future sections, where we'll tackle more complex measures and methods.



Previous post
Buy me a coffee
Next post

More? Well, there you go:

Data Science. Probability Distributions

Data Science. Descriptive and Inferential Statistics

Data Science. Exploratory Data Analysis