Data Science. Correlation

#data-science

Many engineers haven't had direct exposure to statistics or Data Science. Yet, when building data pipelines or translating Data Scientist prototypes into robust, maintainable code, engineering complexities often arise. For Data/ML engineers and new Data Scientists, I've put together this series of posts.

I'll explain core Data Science approaches in simple terms, building from basic concepts to more complex ones.

The whole series:

Correlation — a relationship or connection between two or more variables. Its essence is that when one variable's value changes, the other variable's value also changes (either decreasing or increasing).

When calculating correlations, we aim to determine whether there is a statistically significant relationship between two or more variables in one or more data samples. For instance, we might explore the relationship between height and weight, performance and IQ test results, or experience and productivity.

If you ever read a phrase in a newspaper like "it turned out that these events have such a correlation here", there's a 99% chance (unless stated otherwise) that it refers to the Pearson correlation coefficient. This is the default and most commonly referenced measure of correlation.

Pearson Correlation Coefficient

Ryan

Imagine we have two processes, each with a set of measured parameters. The result is a set of paired numbers representing the parameter values of process 1 and process 2. Assuming a relationship between these processes, we expect this relationship to reveal itself through the parameter values. Instead of a binary "yes" or "no" answer, we prefer a continuous value representing the degree (strength) of the relationship between the two variables. And this is where the Pearson correlation coefficient comes into play.

The Pearson correlation coefficient can range from -1 (indicating a negative correlation) to +1 (indicating a positive correlation). A value of 0 means there's no correlation between the variables. When the coefficient is closer to 1 or -1, we have a strong correlation, and when it's nearer to 0, the correlation is weak.

In positive correlation, an increase (or decrease) in one variable generally results in an increase (or decrease) in the other.
In negative correlation, an increase (or decrease) in one variable usually leads to a decrease (or increase) in the other.

The formula for the Pearson correlation coefficient between two variables X and Y is:

$$ r=\frac{cov}{\sigma_x\sigma_y} $$

where $$cov = M((X - M(X))(Y - M(Y)))$$.

Honestly, I also don't like the mathematical version of the coefficient. A few scary little letters in scary combinations, all this (I hear you guys). The usual text with examples seems much more understandable. So I will try to stick with it.

There is the term "mathematical expectation of quantity", for short, called expectation or mean (we talked about this in the previous post). In the simplest case, its meaning is extremely simple: it is the arithmetic average of all values:

$$ M(X)=\frac{(x_1 + x_2 + ... + x_n)}{n} $$

We then find each value's deviation from the mean:

Calculate the mean $#mx = M(X)#$
Compute deviations $#dx_i = x_i - mx#$

The same process applies to the values of the second variable:

$$ Y = {y_1, y_2, ..., y_n} $$

with mean $#my = M(Y)#$ and deviations $#dy_i = y_i - my#$.

For example, we took the temperature of every patient in the ward. And we also recorded how many aspirin pills the patient took every day. With the procedure described above, we can build a list of the deviations of each patient in the ward from the average temperature of the patients. And apply the same procedure to the corresponding list of deviations of the number of aspirin pills taken by the patient from the average number of pills taken.

Therefore, we can assume that if the temperature is somehow related to the number of aspirin tablets taken, the deviations will be related to them too.

For example, if taking aspirin leads to an increase in temperature, we will see the following. If the patient has taken more pills than others, his temperature deviates from the average upwards.. If the patient has taken fewer pills than the others, his temperature deviates from the average downwards. So both deviations are in the same direction.

If taking aspirin raises temperature, then for patients who took more pills, the temperature deviates upwards. For those who took fewer pills, it deviates downwards. In both cases, the deviations align.
Multiplying each pair of deviations results in a positive value.

Conversely, if aspirin lowers temperature, the product of deviations is negative.

In other words, we obtain an indicator that is positive for a positive relationship and negative for a negative relationship.

$$ D = {dx_1 dy_1, dx_2 dy_2, ...} $$

When the variables are unrelated, a large sample size results in approximately equal positive and negative deviations, and summing them results in a value around zero. The more measurements, the closer the result approaches zero.

To encapsulate this concept, we define covariance:

$$ cov = M(D) $$

Since covariance can vary in value, we need to account for the scale of the variables to interpret the relationship properly. We do this by introducing the standard deviation:

To do this, we introduce another interesting value — derived from the list of squares of deviations from the mean.

$$ DX^2 = {dx_1^2, dx_2^2, ...} $$

$$ \sigma_x = \sqrt{M(DX^2)} $$

Covariance, in absolute terms, does not exceed the product of the standard deviations of the two lists:

$$ |cov| \leq \sigma_x \sigma_y $$

Since standard deviation is positive by definition:

$$ -\sigma_x \sigma_y \leq cov \leq \sigma_x \sigma_y $$

Thus, we can normalize covariance to derive the Pearson correlation coefficient:

$$ r = \frac{cov}{\sigma_x \sigma_y} $$

which gives us a value between -1 and 1, indicating both the direction and strength of the relationship between the variables.

In general, using this coefficient as a measure of correlation is very convenient and universal: it shows whether the values are connected and to what degree.

However, while convenient, the Pearson correlation coefficient has limitations that can lead to incorrect conclusions.

Example #1

Once, a student decided to investigate the relationship between a patient's body temperature and their well-being. It seems obvious that people feel their best at a normal temperature of 36.6 °C. As the temperature rises, well-being declines, and the same happens when it drops.

Student story

If deviations from normal temperature are symmetrical (as shown in the figure), the correlation is close to zero. The student concluded, erroneously, that no correlation exists between temperature and well-being.

Limitations of the Pearson Correlation

The example highlights that the "measure of association" here is not universal. The reasoning in constructing the Pearson correlation only holds for linear dependencies.

In simple terms, it works only for relationships like:

$$ y(x) = kx $$

If the relationship differs, the correlation coefficient can become random. In other words, the researcher must know in advance that there is either a linear relationship or no association at all to interpret the coefficient correctly.

But how could the researcher know this beforehand?

Typically, only by referencing prior studies, possibly based on charts or distributions. But if this detailed analysis is already available, why calculate the correlation? They likely already have more reliable information.

So, how reliable is a measure that can yield random numbers and potentially lead to contradictory conclusions?

In essence, the only sure conclusion from a correlation coefficient is about the absence of a linear relationship across an entire dataset. We can't even confidently conclude the existence of such a relationship.

The aspirin example is similar. With an optimal dose, the best results are achieved. But with too little or too much aspirin, results worsen—yet correlation alone would not reveal this quadratic relationship.

Example #2

Scientist story

There was once a scientist who, every morning, noticed the same man passing by his window. Soon, he observed an interesting pattern: whenever this man carried an umbrella, it would rain that day.

Of course, this wasn't always the case—sometimes the man brought an umbrella, but it didn't rain; other times, he went without one and still got soaked. But overall, there seemed to be a strong association between the presence of an umbrella and rain.

The scientist, skeptical yet intrigued, began systematically recording his observations. Over time, he noted "umbrella" and "rain" events as 1, and "no umbrella" and "no rain" as 0, then calculated the correlation from these records.

The correlation was high—0.95! Clearly, these events were connected.

Thrilled, the scientist published a paper titled "How to Induce Rain," in which he confidently argued that this man controlled the local weather by choosing to carry an umbrella. After all, such a strong correlation couldn't just be coincidence, right?

Correlation is not Causation

While a high correlation coefficient may suggest a relationship between two variables, it doesn't prove causation. A correlation might indicate that the first phenomenon doesn't actually cause the second, nor does the second cause the first.

Often, we invoke the "third variable" argument, suggesting that some other factor is driving both phenomena. But even this approach can be misleading, as it's not always just one external factor at play. And even if we do identify a "third phenomenon," there's no guarantee that it is truly the cause of the observed correlation.

In this case, there's no mystical force behind the man and his umbrella. The man simply checked the weather forecast each morning. If rain was predicted, he brought an umbrella.

It might be tempting to think that some external factor directly links the man's umbrella to the weather. But then, what exactly would that be?

Is it the forecast causing the rain?

Or is it future rain that somehow influences the forecast?

Or maybe it's not the rain, but atmospheric pressure, wind, or proximity to a lake that drives both the forecast and the rain?

The reality is simpler. The man brought an umbrella based on the forecast, not because he possessed any knowledge of upcoming weather. To confirm this, we could even have the weather station provide a false forecast one day and see if he brings his umbrella accordingly. If he does, it proves his decision depends on the forecast and not on physical indicators like barometric pressure that correlate with rain probability.

In short, there's no single "third phenomenon" causing the umbrella-carrying behavior and the rain. Instead, it's a web of interrelated events, where human behavior and forecasts influence outcomes in ways that defy a straightforward cause-effect explanation.

Conclusion

The lesson here is that no single statistical measure can confirm a theory simply because it supports a hypothesis you like. Theories need confirmation through a series of carefully designed experiments with multiple indicators—not just one study, even if it's large.

Statistical calculations offer a foundation for hypotheses, not full-fledged "theories" (and as a side note, check out how Nicolas Cage "causes" pool deaths). A hypothesis, not a theory, is what emerges from initial analysis. The next step is testing this hypothesis to see if it makes accurate predictions.

Nicolas Cage "causes" pool deaths

Only when predictions consistently hold true across new experiments does a hypothesis gain "statistical evidence". This evidence emerges when a relationship between phenomena can predict outcomes on data that wasn't available at the time the hypothesis was formed. This is what ultimately demonstrates the existence of a real connection—not merely a high correlation.

Even if a hypothesis shows consistent predictive power, one successful experiment isn't enough. We must replicate it in varied conditions. After all, a theory shouldn't explain a single case but should account for a broad spectrum of scenarios.

Finally, if a hypothesis consistently generates correct predictions, we still need to test it against other plausible explanations. Otherwise, you're not proving the uniqueness of your hypothesis but rather confirming the validity of a broad set of possibilities, one of which just happens to be your hypothesis.

That's the science of Data.