Home

Data Science. Probability distributions

Data Science. Probability distributions

There are a lot of engineers who have never been involved in statistics or data science. So, in order to build a data science pipelines or rewrite produced by data scientists code to an adequate, easily maintained code many nuances and misunderstandings arises from engineering side. For these Data/ML engineers and novice data scientists, I make this series of articles. I'll try to explain some basic approaches in plain English and, based on them, explain some of the Data Science model concepts.

The whole series:


Numerical data types

They are numerical values, sensible to add, subtract, take averages, etc.

  1. Continuous. A number within a range of values, usually measured, such as height (within the range of human heights).
  2. Discrete. Only take certain values (can’t be decimal), usually counted, such as the count of students in a class.

There are many distributions, but here, we will be talking about the most common and used ones. But first we need to understand the plots, and the common plots for distributions are pdf and cdf.

Probability Density Function(pdf)

Consider an experiment in which the probability of events are as follows. The probabilities of getting the numbers 1,2,3,4 individually are 1/10, 2/10, 3/10, 4/10 respectively. It will be more convenient for us if we have an equation for this experiment which will give these values based on the events. For example, the equation for this experiment can be given by f(x)=x/10 where x=1,2,3,4. This equation (equivalently a function) is called probability distribution function. Although some authors also refer to it as the probability function, the frequency function, or probability mass function. It tells us the probability of occurring random variable x.

Cumulative Distribution Function(cdf)

The cumulative distribution function provides an integral picture of the probability distribution. As the name cumulative suggests, it is simply the probability up to a particular value of the random variable. In the example above given x=3, the cdf tells us the sum probability of all random variables from 1 till 3.

Continuous distributions

In this section, as the title suggests, we are going to investigate probability distributions of continuous random variables, that is, random variables whose support contains an infinite interval of possible outcomes.

Uniform Distribution

from scipy.stats import uniform
import matplotlib.pyplot as plt
import numpy as np

fig, ax = plt.subplots(1, 1, figsize=(15,15))
# calculate a few first moments:
#mean, var, skew, kurt = uniform.stats(moments='mvsk')
# display the probability density function (``pdf``):
x = np.linspace(uniform.ppf(0.01), uniform.ppf(0.99), 100)
ax.plot(x, uniform.pdf(x),'r-', lw=5, alpha=0.6, label='uniform pdf')
ax.plot(x, uniform.cdf(x),'b-', lw=5, alpha=0.6, label='uniform cdf')
# Check accuracy of ``cdf`` and ``ppf``:
vals = uniform.ppf([0.001, 0.5, 0.999])
np.allclose([0.001, 0.5, 0.999], uniform.cdf(vals))

# generate random numbers:
r = uniform.rvs(size=1000)
# and compare the histogram:
ax.hist(r, normed=True, histtype='stepfilled', alpha=0.2)
ax.legend(loc='best', frameon=False)
plt.show()

In statistics, a type of probability distribution in which all outcomes are equally likely; each variable has the same probability that it will be the outcome. A deck of cards has within it uniform distributions because the probability of drawing a heart, a club, a diamond or a spade is equally likely. A coin also has a uniform distribution because the probability of getting either heads or tails in a coin toss is the same.

More formally this is the distribution of a random variable that can take any value in the interval (a, b), and the probability of being in any segment inside (a, b) is proportional to the length of the segment and does not depend on its position, and the probability of values outside the interval (a, b) is equal to 0.

So, a continuous random variable x has a uniform distribution, denoted U(a, b), if its probability density function is: $$ f(x)=\dfrac{1}{b-a} $$

Normal Distribution

from scipy.stats import norm
import matplotlib.pyplot as plt
import numpy as np

fig, ax = plt.subplots(1, 1)
# calculate a few first moments:
mean, var, skew, kurt = norm.stats(moments='mvsk')
# display the probability density function (``pdf``):
x = np.linspace(norm.ppf(0.01),  norm.ppf(0.99), 100)
ax.plot(x, norm.pdf(x),
	   'r-', lw=5, alpha=0.6, label='norm pdf')
ax.plot(x, norm.cdf(x),
	   'b-', lw=5, alpha=0.6, label='norm cdf')
# check accuracy of ``cdf`` and ``ppf``:
vals = norm.ppf([0.001, 0.5, 0.999])
np.allclose([0.001, 0.5, 0.999], norm.cdf(vals))

# generate random numbers:
r = norm.rvs(size=1000)
# and compare the histogram:
ax.hist(r, normed=True, histtype='stepfilled', alpha=0.2)
ax.legend(loc='best', frameon=False)
plt.show()

At the center of statistics lies the normal distribution, known to millions of people as the bell curve, or the bell-shaped curve. This is actually a two-parameter family of curves that are graphs of the probability density functions:

$$ f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left(−\frac{(x−\mu)^2}{2\sigma^2}\right) $$

It looks a bit scary, but we'll get it all. There are two mathematical constants in the density function of the normal distribution:

  • π — the ratio of a circle's circumference to its diameter and it is approximately 3.142;
  • e — the base of the natural logarithm is approximately 2.718;

And two parameters that set the shape of a particular curve:

  • µ is a mathematical expectation or mean. It shows that data near the mean are more frequent in occurrence than data far from the mean.
  • $σ^2$ — variance, will also be discussed in the following articles;

And, of course, the variable x itself, for which the value of the function is calculated, i.e. probability density.

Constants, of course, do not change. But the parameters — this is what gives the final appearance of a specific normal distribution.

So, the specific form of the normal distribution depends on 2 parameters: the expectation (µ) and variance ($σ^2$). Briefly denoted by $N(m, σ^2)$. The parameter µ (expectation) determines the distribution center, which corresponds to the maximum height of the graph. The variance $σ^2$ characterizes the range of variation, that is, the “spreading” of the data.

Why is this distribution so popular?

The importance of the normal distributions stems primarily from the fact that the distributions of many natural phenomena are at least approximately normally distributed. One of the first applications of the normal distribution was to the analysis of errors of measurement made in astronomical observations, errors that occurred because of imperfect instruments and imperfect observers. Galileo in the 17th century noted that these errors were symmetric and that small errors occurred more frequently than large errors. This led to several hypothesized distributions of errors, but it was not until the early 19th century that it was discovered that these errors followed a normal distribution. Independently, the mathematicians Adrain in 1808 and Gauss in 1809 developed the formula for the normal distribution and showed that errors were fit well by this distribution.

Discrete distributions

Bernoulli distribution(binomial distribution)

from scipy.stats import bernoulli
import seaborn as sb

data_bern = bernoulli.rvs(size=1000,p=0.6)
ax = sb.distplot(
	data_bern, 
	kde=True, 
	color='b', 
	hist_kws={'alpha':1},
	kde_kws={"color": "r", "lw": 3, "label": "KDE"})
ax.set(xlabel='Bernouli', ylabel='Frequency')

Not all phenomena are measured on a quantitative scale of type 1, 2, 3 ... 100500... Not always a phenomenon can take on an infinite or a large number of different states. For example, a person’s sex can be either men or woman. The shooter either hits the target or misses. You can vote either "for" or "against", etc. Other words reflect the state of an alternative feature (the event did not come). The upcoming event (positive outcome) is also called "success." Such phenomena can also be massive and random. Therefore, they can be measured and make statistically valid conclusions.

Experiments with such data are called the Bernoulli scheme, in honor of the famous Swiss mathematician, who found that with a large number of tests, the ratio of positive outcomes to the total number of tests tends to the probability of the occurrence of this event.

$$ f(x)=\dbinom{n}{x} p^x (1-p)^{n-x} $$

  • n - the number of experiments in the series;

  • x is a random variable (the number of occurrences of event A);

  • $p^x$ is the probability that A happens exactly m times;

  • q = 1 - p (the probability that A does not appear in the test)

Poisson Distribution

from scipy.stats import poisson
import seaborn as sb
import numpy as np
import matplotlib.pyplot as plt
plt.figure(figsize=(15,15))


data_binom = poisson.rvs(mu=4, size=10000)

ax = sb.distplot(data_binom, kde=True, color='b', 
				 bins=np.arange(data_binom.min(), data_binom.max() + 1), 
				 kde_kws={"color": "r", "lw": 3, "label": "KDE"})
ax.set(xlabel='Poisson', ylabel='Frequency')

The Poisson distribution is obtained as a limiting case of the Bernoulli distribution, if we push p to zero and n to infinity, but so that their product remains constant: np = a. Formally, such a transition leads to the formula

$$ f(x) = \frac{{e^{ - \lambda } \lambda ^x }}{{x!}} $$

  • x is a random variable (the number of occurrences of event A);
  • The average number of events in an interval is designated $\lambda$. $\lambda$ is the event rate, also called the rate parameter. It is also equal to mean and variance.

The Poisson distribution is subject to very many random variables occurring in science and practical life: equipment breaks, the duration of repair work performed by working employee, a printing error, the number of goals and goals scored by a football team.

Conclusion

There are a lot of theoretical distributions: Normal, Poisson, Student, Fisher, binomial, etc. Each of them was designed to analyze data of various origins and having certain characteristics. In practice, these distributions are used as some kind of template for analyzing real data of a similar type. In other words, they try to impose the structure of the chosen theoretical distribution on the real data, thereby calculating the probabilities of interest to the analyst.

More strictly speaking, theoretical distributions are probabilistic-statistical models whose properties are used to analyze empirical data. This is done something like this. Data is collected and compared with any known theoretical distribution. If there is a similarity, then the properties of the theoretical model with degree of confidence are transferred to empirical data with the corresponding conclusions. This approach underlies the classical methods associated with statistical hypothesis testing(calculation of confidence intervals, comparison of average values, checking the significance of parameters, etc).

If the available data do not correspond to any known theoretical distribution (which usually happens in practice, but this does not concern anyone), then it is not recommended to use the selected template (probabilistic-statistical model). The illegal use of parametric distributions (listed above) leads to a situation where the analyst searches for keys not where he lost, but under a lamppost where it is light. To solve the problem, there are other approaches associated with the use of non-parametric statistics.

Links

Support author