Data Science. Descriptive and Inferential Statistics

#data-engineering #data-science #programming

Many engineers haven't had direct exposure to statistics or Data Science. Yet, when building data pipelines or translating Data Scientist prototypes into robust, maintainable code, engineering complexities often arise. For Data/ML engineers and new Data Scientists, I've put together this series of posts.

I'll explain core Data Science approaches in simple terms, building from basic concepts to more complex ones.

The whole series:

"Facts are stubborn things, but statistics are pliable"
― Mark Twain

The job of a data analyst is not to come up with a lot of fancy reports containing tons of data as it may first seem. He needs to understand what the data can tell the business or help it solve existing problems.

Okay, we have data what's next?

The next step is to get insights from them.

What are insights?

Insights are valuable knowledge obtained with the help of data analytics. It's a very general definition because under the category of insides can fall a lot of things — from finding the most wasted category in your budget to understanding which movie will make more money for the cinema. Insights can be applied to the acquired data samples or to the whole population this data came from.

Unfortunately, in most cases, it is not possible to understand something simply by looking at the data. There is a lot of data and the raw data is not the most convenient object for making assumptions, the data should be characterized by a set of easily interpreted attributes. This is what descriptive statistics do.

Descriptive statistics looks for any general characteristics, and patterns, and describes collected real-world data samples. We have already considered many of them when talking about measures and correlations.

The definition of descriptive statistics in this respect looks somewhat blurred, and this is because in one case it is enough to limit only to basic characteristics like mean and variation, in others, it is necessary to deep dive into more specific methods to understand the data. Thus, the variety of approaches to descriptive statistics is primarily dictated by problem-oriented features of the data, but in general, it is possible to offer a common set of characteristics and approaches, which will make up the framework of descriptive statistics.

Approaches of descriptive statistics can be distinguished by the number of attributes (one-dimensional, multidimensional) as well as by the type of data (quantitative and qualitative).

We will look at the following variants: one-dimensional task with numerical data; one-dimensional task with categorized (non-numerical) data; and one-dimensional task with mixed data.

Univariate Analysis

There is a data sample with n observations of just one attribute. We will consider the following set of characteristics as the basic descriptive statistics for such a one-dimensional data array:

mean value
maximum and minimum values
standard deviation
coefficient of variation
quartiles: 25%, 50%, 75%
mode
kurtosis
kernel density estimation

Everything should be clear with the parameters of average, minimum, maximum values, and standard deviation, let's focus on the rest.

Standard deviation to mean ratio — coefficient of variation is a dimensionless value characterizing the variation of the data sample. As a rule, the units of measurement of this value are converted into percentages by multiplying it by 100. Its values show how much the standard deviation — the absolute (dimensional) variation measure is greater than the mean. In the case of zero, this value is not defined. It is a useful statistic for comparing the degree of variation from one data series to another, even if the means are drastically different from one another.

Quartiles(also percentiles, or quantiles) — values that divide the values into four groups containing an approximately equal number of observations. The total ordered number of numbers is divided into four equal parts: 25%, 50%, 75% 100%.

For example, the median, or 50% percentile or 2nd quartile, for the set [1,2,3,4] will be 2.5 (here the nonlinear index is also 2.5=1+4100⋅502.5=1+4100⋅50).

Knowing quartiles provides information on the minimum and maximum of the data, how big the spread is, and if the dataset is skewed toward one side, the presence of outliers in the data.

Kurtosis

Kurtosis — a quantitative distribution characteristic near the mathematical expectation in relation to the normal distribution.

In the case of positive kurtosis, there is a "bell" with a high peak near the mean distribution, in the case of negative kurtosis — the distribution is shallower than the normal "thickening" of the tails. Value 3 corresponds to the normal distribution if Pearson kurtosis is used, for Fisher kurtosis, it's zero.

The kurtosis statistics appear to be very dependent on the sample size. Smaller sample sizes can give results that are very misleading.

Knowledge of the kernel density estimation gives an overview of the data distribution. It is in some way a generalization of the histogram concept. The histograms are a piecewise linear estimation of probability distribution density; kernel estimations are smooth "generalizations" of the histogram and give a more adequate idea of distribution density.

Let's consider a sample of 500 values generated from two normal distributions, shifted relative to each other and having different dispersions. Since the distribution components are shifted relative to each other, and each individual normal distribution is a single-mode (single-needle) distribution, the resulting distribution will be a two-mode distribution, i.e. the distribution density will have two local maximums.

import numpy as np
import plotly.figure_factory as ff

sample = np.hstack((1.5 * np.random.rand(200).ravel(), 2 + 1.2 * np.random.rand(300)))

fig = ff.create_distplot([sample], ['distplot'])
fig.show()

As a result, the code that performs the calculation of basic statistical parameters together with the test sample will look like this:

import typing as T

import numpy as np
import scipy.stats as st
import plotly.figure_factory as ff


def calc_desc1d(sample: T.Iterable) -> T.Tuple: 
    result = [] 
    result.append(len(sample)) 
    result.append(np.mean(sample)) 
    result.append((np.min(sample), np.max(sample))) 
    result.append(np.std(sample)) 
    result.append(result[0]/result[-1]) 
    result.append((np.percentile(sample, 25), np.percentile(sample, 50), np.percentile(sample, 75))) 
    result.append(st.mode(sample)) 
    result.append(st.kurtosis(sample))
    _range = np.linspace(0.9 * np.min(sample), 1.1 * np.max(sample), 100) 
    result.append(st.gaussian_kde(sample)(_range))
    return tuple(result)

def descriptive1d_quantative(sample: T.Iterable) -> None: 
    n, m, minmax, s, cv, perct, mode, kurt, kde = calc_desc1d(sample)
    print(f"Number of elements in sample: {n}")
    print(f"Mean: {m}")
    print(f"Min, max: {minmax}")
    print(f"Std: {s}")
    print(f"Dispersion: {cv}")
    print(f"Mode: {mode[0]}")
    print(f"Quartiles: \n\t(25%) = {perct[0]}, \n\t(50%) = {perct[1]}, \n\t(75%) = {perct[2]}")
    print(f"Kurtosis: {kurt}")
    
    fig = ff.create_distplot([sample], ['distplot'])
    fig.show()

And the results of its work will be approximately the following:

Number of elements in sample: 500
Mean: 1.8325597597096608
Min, max: (0.014421103192962192, 3.199713640415771)
Std: 1.0185414748468167
Dispersion: 490.89802658767275
Mode: [0.0144211]
Quartiles: 
	(25%) = 0.7811891194331109, 
	(50%) = 2.169811866965774, 
	(75%) = 2.7310514192500137
Kurtosis: -1.3475589255969815

Let's try some real data. I got the dataset from the video games.

import pandas as pd
df = pd.read_csv('Video_Games_Sales_as_at_22_Dec_2016.csv')
column = 'Critic_Score'
descriptive1d_quantative(df[column].astype(float).dropna())

Number of elements in sample: 8137
Mean: 68.96767850559173
Min, max: (13.0, 98.0)
Std: 13.937308058262083
Dispersion: 583.8286680602112
Mode: [70.]
Quartiles: 
	(25%) = 60.0, 
	(50%) = 71.0, 
	(75%) = 79.0
Kurtosis: 0.1420259847017955

Most often the score was 70, which is not very much lower than the maximum score of 98. The median (50% quartile) and mean are close to each other means that the distribution is kind of symmetrical. Large dispersion but similar values in quartiles show that the distribution has a peak with big data spread, bigger on the left side if we glance at the difference between min value and the mean. Judging by the kurtosis, the distribution is lools close to normal.

Description of non-numerical data

In the case of non-numerical data, possible operations are much more limited. We cannot calculate the mean and standard deviation; practically all statistical characteristics used to describe quantitative data become unavailable. What we can do about non-numerical data to describe them is:

Determining the number of unique items.
Determining the frequency of those items
Determining the most frequently occurring item (mode of distribution)
Determining the rarest item
Quantitative characterization of heterogeneity in the frequency of the elements (for example, Shannon entropy)

Let's define the calculation of indicators 1-5 (and some more) as a function implemented in Python:

from collections import Counter 
import string, random


def calc_desc1d(sample: T.Iterable) -> T.Tuple: 
    count = Counter(sample)  # used to calculate number of unique elements
    result = [] 
    result.append(len(sample))  # number of items in the sample
    result.append(dict(count))  # number of unique elements
    result.append({k: v / float(result[0]) for k, v in count.items()})  # frequency of items
    # most frequent item
    result.append(
        dict(
            filter(lambda x: x[1] == count.most_common()[0][1], count.most_common())
        )
    )
    # most rare item  
    result.append(
        dict(
            filter(lambda x: x[1] == count.most_common()[-1][1], count.most_common())
        )
    )
    result.append(st.entropy(list(count.values()))) # Shennon entropy
    result.append(np.log(len(sample))) # max entropy
    return tuple(result)

def descriptive1d_qualitative(sample: T.Iterable) -> None:
    n, uniques, freq, most_common, rare, entropy, maxentropy = calc_desc1d(sample)
    print(f'Number of elements in sample {n}')
    print(f'Number of elements by type: {uniques}')
    print(f'Item frequencies: {freq}')
    print(f'Most frequent item: {most_common}')
    print(f'Most rare item: {rare}')
    print(f'Shennon entropy: {entropy}')
    print(f'Max entropy: {maxentropy}')

    data = sample.value_counts()
    fig = go.Figure(data=[
        go.Pie(labels=data.keys(), values=data.values, hole=.3, textinfo='label+percent')
    ])
    fig.update_layout(
        title_text='Popular game genre', 
        annotations=[dict(text='Genre', font_size=30, showarrow=False)]
    )
    fig.show()

In the end, we will get something similar:

Number of elements in sample 16719
Number of elements by type: {'Sports': 2348, 'Platform': 888, 'Racing': 1249, 'Role-Playing': 1500, 'Puzzle': 580, 'Misc': 1750, 'Shooter': 1323, 'Simulation': 874, 'Action': 3370, 'Fighting': 849, 'Adventure': 1303, 'Strategy': 683, nan: 2}
Item frequencies: {'Sports': 0.14043902147257611, 'Platform': 0.05311322447514803, 'Racing': 0.07470542496560799, 'Role-Playing': 0.0897182845863987, 'Puzzle': 0.034691070040074164, 'Misc': 0.10467133201746516, 'Shooter': 0.07913152700520366, 'Simulation': 0.05227585381900832, 'Action': 0.20156707937077575, 'Fighting': 0.05078054907590167, 'Adventure': 0.07793528321071834, 'Strategy': 0.04085172558167355, nan: 0.00011962437944853162}
Most frequent item: {'Action': 3370}
Most rare item: {nan: 2}
Shennon entropy: 2.354323487666032
Max entropy: 9.724301076237646

Mixed data

The case of mixed data is largely determined by the nature of the problem. Non-numerical data can be filtered out from a whole array of numerical data. Thus, the task can be reduced to the analysis of an array of numeric and non-numeric data. This approach, which doesn't take into account the specifics of the data processing task, is quite applicable if you need to draw up a general characteristic of them.

Inferential Statistics

Descriptive statistics provide information about the data sample we already obtained. For example, we could calculate the average and standard deviation of SAT 50 students, and this could give us information about this group of 50 students. But very often you are not interested in a specific group of students, but in students in general - for example, all students in the US. Any sample of data that includes all the data you are interested in is called a population.

Very often it happens that you do not have access to the entire population in which you are interested, but only a small sample. For example, you may be interested in exam scores for all students in the US. It is not possible to collect all the exam scores of all students in all states, so you will have to measure a smaller sample of students (e.g. 100 students), which is used to display a larger population of all US students. However, it is important that the data sample accurately reflects the overall population. The process of achieving this is called sampling (sampling strategies are described in detail in this post).

Inferential statistics allow us to draw conclusions about the population from sample data that might not be immediately obvious. Inferential statistics emerges due to the fact that sampling naturally leads to a sampling error, and therefore the sampling is not expected to perfectly reflect the population. The methods of inferential statistics are:

Parameter Estimation
Hypothesis Testing

Conclusion

Descriptive statistics help you understand your data and are an initial and very important step in data science. This is because data science makes predictions, and you cannot predict if you cannot understand the patterns in available data.

As you have seen, descriptive statistics are only used to describe some of the main properties of the data in a study. They provide a simple summary of the sample and allow us to view the data in a meaningful way. This makes it easier to interpret the data in the correct way in the later stages, in EDA for example.