Home

Data Science. Measures

Data Science. Measures

There are a lot of engineers who have never been involved in statistics or data science. So, in order to build a data science pipelines or rewrite produced by data scientists code to an adequate, easily maintained code many nuances and misunderstandings arises from the engineering side. For these Data/ML engineers and novice data scientists, I make this series of articles. I'll try to explain some basic approaches in plain English and, based on them, explain some of the Data Science model concepts.

The whole series:


In statistics, what we see in our data itself is not interesting to us of itself, but as an assessment of where it comes from (population). In order to asses and describe the distribution of characteristics, we need to know a couple of things. First, the values of these characteristics, which are typical for the distribution under study. Second, how much these values differ (scattered), how much they are typical.

The first task is solved by measures of the central tendency. The second task is solved by measures of variation.

Measures of Central Tendency

For the study of various measures of the central tendency, we need to construct the data, let it be in format of the story.

Imagine a situation that 5 people are sitting in a bar with an income of 40k. The average income of people in the bar is 40k.

data = [40, 40, 40, 40, 40]

Using this data we will now go to the theory.

Mean

Probability theory is concerned with the study of random variables. For this purpose, various characteristics are calculated which describe their behavior. One of the main characteristics of a random variable is the mean(others called it also mathematical expectation or average), which is a kind of the center of the data around which values are grouped.

Mean value gives us a generalized estimate, helps us to decide. For example, the average income in a company gives us a rough estimate of how much new employee can get there, the average restaurant’s check us to orient is it worth going there or not.

Mean is intuitive measure and has the following formula:

$$ mean = \sum\limits_{i = 1}^n {x_i p_i} $$

where $x_i$  – random variables, $p_i$ – their probabilities

Hence, the mean of a random variable is a weighted sum of random variable values, where the weights are equal to the corresponding probabilities.

For example, if you calculate the mean value of the sum of points when throwing two dice, you get the number 7. But we know for sure all possible values and their probabilities. And what if there is no such information? There is only the result of some observations. How to be? It comes in statistics, which allows obtaining an approximate value of the mean, to evaluate it from the available data.

Mathematical statistics provides several options for estimating the mean value. The main among them is the arithmetic mean, with a number of useful properties. For example, the arithmetic mean is an unbiased estimate, i.e. average expectation equals estimated expectation.

So, the arithmetic mean value is calculated by the formula, which is known to any student.

$$ mean = \frac{1}{n} \sum\limits_{i = 1}^n {x_i} $$

where $x_i$ – random variables, n – number of values.

def mean(x):
	return sum(x) / len(x)

mean(data)  # 40

The disadvantage of this measure is the sensitivity to various deviations and inhomogeneities in the sample, in other words, it is subject to significant distortions from the side of emissions that deviate sharply from the distribution center. For distributions with a large asymmetry factor, it may not correspond to the notion of the mean.

If another person comes to our imaginary bar with an income of $40k, the average income of people in the bar will not change and will be the same.

data1 = data + [40]
mean(data1)  # 40

If the bar goes Jeff Bezos with an income of 10 billion. The average income at the bar will be 1700, although the income of the first 4 people has not changed.

data2 = data + [10000]
mean(data2)  # 1700

In order to deal with this problem, there are other measures of the tendency: truncated mean, mode and median.

	 

Mode

Another intuitive measure. A mode is the most common value. This is simply the value that is most often found in the data sample.

There can be several modes. And the presence of several modes is in fact also a certain characteristic of the data that need to be noticed. This is a sign that the data have some kind of internal structure, that there may be some subgroups that are qualitatively different from each other. And maybe it makes sense not to look at this distribution as a whole, but to divide it into subgroups and look at them separately.

def mode(x):
	"""returns a list, might be more than one mode"""
	counts = Counter(x)
	max_count = max(counts.values())
	return [x_i for x_i, count in counts.iteritems() if count == max_count]

mode(data)  # [40]

A mode is irreplaceable for nominal variables and is of little use for quantitative. It also helps us evaluate the most typical data sampling value.

Median

The central tendency can be viewed not only as a value with a zero total deviation (arithmetic mean) or maximum frequency (mode) but also as a certain mark (a certain level of the analyzed indicator) dividing the ordered data (sorted in ascending or descending) into two equal parts. That is, half of the source data is less than this mark in its value, and half more. This is a median.

Mode and median are important measures, they reflect the data structure and are sometimes used instead of the arithmetic mean.

def median(v):
	"""finds the 'middle-most' value of v"""
	n = len(v)
	sorted_v = sorted(v)
	midpoint = n // 2
	if n % 2 == 1:
		# if odd, return the middle value
		return sorted_v[midpoint]
	else:
		# if even, return the average of the middle values
		lo = midpoint - 1
		hi = midpoint
		return (sorted_v[lo] + sorted_v[hi]) / 2
    
    median(data)  # 40
    median(data2)  # 40

Obviously, with a symmetric distribution, the middle, dividing the sample in half, will be in the very center — in the same place, where the arithmetic mean (and mode). This is, so to speak, the ideal situation when the mode, the median and the arithmetic mean coincide and all their properties fall on one point — the maximum frequency, the division in half, the zero-sum of deviations — all in one place. However, life is not as symmetrical as the normal distribution.

Measures of Variability(Dispersion)

In order to understand how characteristics of the data sample behave, it is not enough to know the mean, it is not enough to know the typical values of these characteristics, you also need to know their variability. That is, we must not only know what is typical, but we must also know how variable the values are, how strongly those values that do not resemble it differ from the mean. And for this, we have measures of variation.

Let's go back to our imagine situation. Imagine that we have now two bars:

data1 = [40, 40, 40, 40, 40]
data2 = [80, 40, 15, 25, 40]

mean(data1)  # 40
mean(data2)  # 40
median(data1)  # 40
median(data2)  # 40
mode(data1)  # [40]
mode(data2)  # [40]

Clearly, they seem similar to analyst. But the data is different.

Range

The simplest and most understandable measure of variability is range.

Range is the distance between the minimum and maximum value of the characteristic.

def data_range(x):
	return max(x) - min(x)

data_range(data1)  # 0
data_range(data2)  # 65

On the one hand, the range can be quite informative and useful. For example, the maximum and minimum cost of an apartment in the city of N, the maximum and minimum salary in the region, and so on. On the other hand, the range can be very large and not have any practical meaning.

This measure says how much the values in the sample vary but does not say anything about the distribution itself.

	 

Variance

If the mean reflects the center of a random variable, then the variance gives a characteristic of the spread of data around the center and it takes into account the influence of the values of all objects.

The formula for variance is:

$$ s^2 =\frac{1}{n}\sum\limits_{i = 1}^n {\left( {x_i - \bar x} \right)^2 } $$

where x – data value, $\bar x$ – mean value, n – number of objects.

Hence, for each object, we take a deviation from the average, square them up and then divide by the number of objects in the sample.

Why square?

The sum of negative and positive deviations will give zero because of deviations in minus and deviations in plus cancel each other. In order to avoid this mutual cancellation of pluses by minuses, the squaring of this measure in the numerator is used. As for the denominator, we divide by n. However, using values other than n improves estimator in various ways. Common values for the denominator is n − 1, it eliminates bias.

def de_mean(x):
	"""translate x by subtracting its mean (so the result has mean 0)"""
	x_bar = mean(x)
	return [x_i - x_bar for x_i in x]

def sum_of_squares(y):
	"""the total squared variation of y_i's from their mean"""
	return sum(v ** 2 for v in de_mean(y))

def variance(x):
	"""assumes x has at least two elements"""
	n = len(x)
	deviations = de_mean(x)
	return sum_of_squares(deviations) / (n - 1)

variance(data1)  # 0
variance(data1)  # 612

Thus, we take into account each deviation, and the sum divided by the number of objects gives us an estimate of the variability.

What are the problems here?

The fact that we are squaring, gives us multiple increases in the measurement. That is, if in the first case with our salaries we speak in dollars, in thousands of dollars, then when we square it, we begin to operate with millions or even billions. And this is reasonable from the point of view of squaring, but not informative from the point of view of the specific salaries that people in the organization get.

Standard deviation

In order to return the dispersion to reality, that is, to use for more practical purposes, a square root is extracted from it. It turns out the so-called standard deviation. It's formula is:

$$ s = \sqrt {\frac{1}{n}\sum\limits_{i = 1}^n {\left( {x_i - \bar x} \right)^2 } } $$

def standard_deviation(x):
	return math.sqrt(variance(x))

standard_deviation(data1)  # 0
standard_deviation(data2)  # 24.7

Obviously, the standard deviation also characterizes the measure of variability, but now (unlike the variance) it can be compared with the original data since they have the same units of measurement (this is clear from the calculation formula). But this measurement in its pure form is not very informative since it contains too many interim calculations that confuse (deviation, squared, sum, average, root). However, it is already possible to work directly with the standard deviation, because the properties of this measurement are well studied and known.

For example, there is a three-sigma rule that states that a normally distributed data has 997 values out of 1000 within ± 3 sigma from the arithmetic mean. Standard deviation, as a measure of uncertainty, is also involved in many statistical calculations. With its help, establish the degree of accuracy of various estimates and projections. If the variation is very large, then the standard deviation will also be large, therefore, the forecast will also be inaccurate, which will be expressed, for example, in very wide confidence intervals.

Conclusion

These are the basic measures data engineer should know, but not all of them. They will be used in EDA section of the series where some new measures will be introduced.

Support author