Data Science. Correlation

Data Science. Correlation

There are a lot of engineers who have never been involved in the field of statistics or data science. But in order to build a data science pipelines or rewrite produced code by data scientists to an adequate, easily maintained code many nuances and misunderstandings arise from the engineering side. For those Data/ML engineers and novice data scientists, I make this series of posts. I'll try to explain some basic approaches in plain English and, based on it, explain some of the Data Science basic concepts.

The whole series:

Correlation — a relationship or connection of two or more variables. Its essence is that when one variable's value changes, the other variable changes as well(decreases or increases).

When calculating correlations, we try to determine whether there is a statistically significant relationship between two or more variables in one or more data samples. For example, a relationship between height and weight, a relationship between performance and IQ test results, a relationship between experience and performance.

If you read a phrase in a newspaper like "it turned out that these events have such a correlation here", then in about 99% of cases, unless otherwise stated, we are talking about Pearson correlation coefficient. This is the default correlation.

Pearson Correlation Coefficient


Let us assume that we have two processes, for each of which we measure some set of parameters. As a result, we get a set of pairs of numbers (process 1 and process 2 parameter values). Assuming that these processes are somehow related, we assume that this relationship should reveal itself numerically in the values of the parameters. Therefore, we can somehow get information about the presence or absence of the relation from the resulting number of pairs.

However, the relationship can be of different strength, so it is desirable to obtain not a binary "yes" or "no", but a kind of continuous number that will characterize the degree (strength) of the relationship between the two variables. And now, meet the Pearson correlation coefficient.

It can vary from -1 (negative correlation) to +1 (positive correlation). If the correlation coefficient is 0, it indicates that there is no correlation between the variables. And if the correlation coefficient is closer to 1 (or -1), then it is strong correlation, and if it is closer to 0, then weak correlation.

In the case of positive correlation increasing (or decreasing) the values of one variable necessarily leads to increasing (or decreasing) the values of another variable. In the case of negative correlation increasing (or decreasing) the values of one variable certainly leads to decreasing (or increasing) the values of another variable.

The formula for Pearson correlation coefficient for two variables X and Y:


where $$cov=M((X - M(X))(Y - M(Y))$$

Honestly, I also don't like the mathematical version of the coefficient. A few scary little letters in scary combinations, all this (I hear you guys). The usual text with examples seems much more understandable. So I will try to stick with it.

There is the term "mathematical expectation of quantity", for short, called expectation or mean (we talked about this in the previous post). In the simplest case, its meaning is extremely simple: it is the arithmetic average of all values:

$$M(X)=\frac{(x_1 + x_2 + ... + x_n)}{n}$$

Next, we find for each number in the list its deviation from the mean:

$$mx = M(X)$$

$${dx_1, dx_2, ...} = {x_1 - mx, x_2 - mx, x_3 - mx, ...}$$

The same can be done with the values of the second variable measured simultaneously with the first variable:

$$Y={y_1, y_2, ..., y_n}$$

$$my = M(Y)$$

$${dy_1, dy_2, ...} = {y_1 - my, y_2 - my, ..., y_n - my}$$

For example, we took the temperature of every patient in the ward. And we also recorded how many aspirin pills the patient took every day. With the procedure described above, we can build a list of the deviations of each patient in the ward from the average temperature of the patients. And apply the same procedure to the corresponding list of deviations of the number of aspirin pills taken by the patient from the average number of pills taken.

Therefore, we can assume that if the temperature is somehow related to the number of aspirin tablets taken, the deviations will be related to them too.

For example, if taking aspirin leads to an increase in temperature, we will see the following. If the patient has taken more pills than others, his temperature deviates from the average upwards.. If the patient has taken fewer pills than the others, his temperature deviates from the average downwards. So both deviations are in the same direction.

If we multiply in pairs all the deviations, then the result will be positive.

And vice versa — if taking aspirin lowers the temperature, then all products of deviations will be negative.

In other words, we have obtained a certain indicator with an interesting property — it is positive for the positive relation of phenomena and negative for negative relation.

$$D={{dx_1 dy_1,dx_2 dy_2,...}}$$

But at the end of the day, if the variables are not related, then with a large number of measurements both of them should be distributed approximately evenly (we will talk about this in next post) — positive and negative deviations. If you add them up, you'll probably get something around zero. Moreover, the more measurements the closer to zero.

Yeah, it mean again. Therefore, here it is, the criteria — covariance:

$$cov = M(D)$$

Covariance can have an arbitrary value, and therefore in order to determine the relationship between the lists, you must also know the maximum range for these variables. This is because they can have different scales.

To do this, we introduce another interesting value — derived from the list of squares of deviations from the mean.

$$DX^2 = {dx_1^2, dx_2^2, ...}$$

$$\sigma_x = \sqrt{M(DX^2)}$$

It is called "standard deviation".

So, it can be shown that the covariance in its absolute value does not exceed the product of the standard deviations of these two lists.

$$|cov| ≤ \sigma_x * \sigma_y$$

Well, since the standard deviation by construction is always positive, we can conclude that

$$-\sigma_x*\sigma_y ≤ cov ≤ \sigma_x*\sigma_y $$




In general, if you take this coefficient as a measure of relation, it will be very convenient and very universal: it will show whether the values are connected to each other and how much.

Everything is very convenient and universal, but in the above reasoning, there are many drawbacks, which, given the universality of the obtained "measure of relation" is very convenient to ignore, but they can lead to wrong conclusions.

Example #1

Once there was a student who decided to find out the connection between the patient's body temperature and how he is feeling. It's obvious that at a normal temperature of 36.6 °C the health seems to be the best. When the temperature rises, health gets worse. However, when the temperature drops, health also gets worse...

Student story

If the measures to the left and right of the normal temperature are symmetrical (see figure), the correlation is close to zero. From what the student concluded that there is no correlation between temperature and well-being (ha!).

Example #1

The example shows that the "measure of association" we have introduced is not universal. The reasoning given in its construction refers only to linear dependencies.

Roughly speaking, it works only when the following dependence is observed:

$$y(x) = kx$$

If the dependence is different, then, in general, the correlation coefficient can be random. Yes, yes, not zero, but chance.

In other words, a researcher somehow must know in advance that there is either a linear dependence or no association at all to draw more or less correct conclusions.

But how could he know that before the study?

Only from other studies. Where the analysis was done by other methods. For example, it could be based on a thoughtful study of charts or distributions.

But if all this is already done, why should he even consider the correlation? He already has more reliable results.

How good is such a measure, which gives a random number and therefore leads to completely different conclusions?

As a result, it turns out that the only thing that can be done with the help of correlation is conclusion about the absence of linear connection on the whole interval. We are talking about the absence — it is impossible to conclude even about the existence of such a relationship.

And it will be the same with aspirin tablets that have the optimum dose — with such a dose, the results will be the best, but with less and more — they will be worse.

Example #2

There was a scientist once. Every morning, he saw the same person passing by his window. After a while, he noticed an amazing pattern — if this man had taken an umbrella with him, it was raining that day.

Of course, this did not happen every time — sometimes the man took an umbrella, but it did not rain, sometimes he did not take it, but it still rained. But too often the presence of an umbrella with rain and the absence of an umbrella without rain occurred together.

The scientist, of course, did not trust such his hypothesis (after all, he was a scientist), but instead began to carefully record his observations every day. After a while, he referred the events "with an umbrella" and "rain" to 1, the events "without an umbrella" and "without rain" to 0, and calculated the correlation on the records.

The correlation was very high: 0.95. These two events were connected!

The proud scientist wrote an article entitled "How to Induce Rain" in which he convincingly proved that this guy controls the rain at his place of residence by wearing an umbrella.

After all, such a high correlation cannot be the result of chance.

Example #2

Scientist story

If the correlation coefficient shows a real correlation of one variable with another (and it's not a random number like it was with the student), it still doesn't prove that one phenomenon was caused by another. It may be that the first phenomenon is not caused by the second and even the second phenomenon is not caused by the first.

In these cases, they say, "maybe there is a third phenomenon that causes both", but in some cases, it is more misleading than helpful. After all, it may indeed be a third phenomenon, but still options don't end there.

And, moreover, no one guarantees that the "third phenomenon" that someone suggested is the cause of these two.

Of course, there is no external mystical force in the above story.The guy with the umbrella was just watching the weather forecast in the morning, and if the rain was promised, he would take an umbrella with him.

It may seem that, well, if not a mystical power, then there is still some factor that causes the first two phenomena.

Let's say there is. But which one?

Does the weather forecast make it rain?

Or does future rain cause forecast?

Okay, not the rain itself, but the air pressure, wind speed and direction, the presence of a lake, etc., and causes rain and forecast?

However, in the morning this person takes an umbrella because he just reads the published forecasts and does not know about the upcoming weather. To check this, we can simply ask the source of the forecast in which he reads these forecasts, to publish the wrong results. And so we will see that the umbrella is taken depending on the forecast, and not based on physical parameters that indicate a high probability of rain.

That is, it is not one "third phenomenon" that causes the first two. Instead, there are a number of phenomena that are in very complicated relationships with each other.


The moral of stories is that no single statistical indicator alone can confirm a theory that you like. Theories are only confirmed by a set of indicators in a properly constructed series of experiments. A series, not one experiment — even with a lot of data.

Whatever statistics you calculate, it only gives you some basis for hypotheses and assumptions (look at how Nicholas Cage makes people die in a pool. For hypotheses, not "theories" that many people like to claim in the first stage.

Based on the results of the first experiment, it is necessary to formulate a hypothesis and in subsequent experiments to check whether it gives true predictions.

With those predictions which have come true on new experiments, the truth will appear in the form of "statistical evidence" — after all the offered relation of one phenomenon to another allows to make predictions on those data which we have not yet received as a hypothesis at the moment of its introduction. This proves the reality of the connection, not just a high correlation.

Moreover, it is not enough to repeat the same experiment and make sure that it worked a second time. Whether it worked or not, it is necessary to check all this in other conditions as well. After all, the real theory can not describe one particular case — it should cover a fairly wide range of possible options.

But this is not the end — even if this hypothesis gives correct predictions, it is still necessary to check the following experiments so that all alternative hypotheses do not work. Otherwise, it turns out that you have not proved the correctness of your hypothesis, but only the correctness of a fairly extensive set of hypotheses, including yours.

That's the science of data.

Previous post Next post

Buy me a coffee