There are a lot of engineers who have never been involved in the field of statistics or data science. But in order to build data science pipelines or rewrite produced code by data scientists to an adequate, easily maintained code many nuances and misunderstandings arise from the engineering side. For those Data/ML engineers and novice data scientists, I make this series of posts. I'll try to explain some basic approaches in plain English and, based on it, explain some of the Data Science basic concepts.
The whole series:
- Data Science. Probability
- Data Science. Bayes theorem
- Data Science. Probability distributions
- Data Science. Measures
- Data Science. Correlation
- Data Science. The Central Limit Theorem and sampling
- Demystifying hypothesis testing
- Data types in DS
- Descriptive and Inferential Statistics
- Exploratory Data Analysis
Rushing to quickly impress those interested in business, data scientists tend to miss out entirely on the process of getting to know the data. This is a very serious and unfortunately common mistake that may lead to the following problems:
- lost insights and therefore to the unfortunate results of the project.
- the creation of inaccurate models due to unclear data structure, outliers in data, skew in data;
- creating accurate models based on incorrect data;
- selecting the wrong attributes for the model;
- inefficient use of resources.
The purpose of Exploratory Data Analysis is to get acquainted with the data: to understand the data structure, to check missed values, to check anomalies in the data, to form hypotheses about the population, to define and clarify the choice of variable characteristics that will be used for machine learning, etc.
You will be surprised(sarcasm), but sometimes you don't need machine learning at all and a simple heuristic can easily beat all the models. To find that heuristic you also need to know the data.
The data analysis is valuable because it allows you to be more confident that future results will be reliable, correctly interpreted, and applied in the desired business context.
Most EDA methods are graphical in nature. The reason for the strong reliance on graphics is that by its very nature, EDA's primary role is in open research, and graphics combined with the natural pattern recognition capabilities that we all have gives analysts unprecedented power to understand the data.
There is an infinite number of possible charts, graphs, and graphs, but you only need a handful of them to know the data well enough to work with.
Who are the billionaires?
Let's try to illustrate the EDA methods on the billionaires' dataset. It's always interesting to count other people's money, isn't it?
The tools for EDA can be different. We will use the following tools together with Python:
- Jupyter notebook for live coding and visualizations
- Standard tools for anyone working with data — pandas, numpy
- Visualization tools — matplotlib, seaborn, plotly,
Start with basics
Start by answering simple questions:
- How many entries do we have on the dataset?
- How many columns/features do we have?
- What types of data do we have in our features?
- Do all columns in the dataset make sense?
- Do we have a target variable?
df = pd.read_csv('billionaires.csv') df.shape df.columns df.head()
The purpose of displaying examples from the data set is not to make a thorough analysis. It is to get a qualitative "sense" of the data we have.
The goal of descriptive statistics is to have a generalized view of your data so that you can begin to query and visualize your data in different ways.
describe() in pandas is very convenient for obtaining various summary statistics, it returns the amount, average, standard deviation, minimum and maximum values, as well as the data quantiles
- There is notably a large difference between 75th %tile and max values of wealth.
- We can immediately understand that we have data from 1996 to 2014, and 2014 is also the median, that is, we have a lot of data specifically for 2014.
- The year of the millionaires' birth obviously has some strange values like -47 years.
At this stage, you should start taking notes about potential fixes you would like to make. If something looks out of place, such as a potential deviation in one of your features, now would be a good time to ask the client/key stakeholder, or dig a little deeper.
We got the first look at the data. Let's now explore the data with some plotting.
Plot quantitative data
Often a quick histogram is enough to understand the data.
Let's start with the interesting thing — how much money does anyone have?
plt.figure(figsize=(15,10)) sns.distplot(df['wealth.worth in billions']) plt.xscale('log')
I used a logarithmic scale to at least show some distribution. Obviously, there are many more people who don't have huge amounts of money but there is also a long tail that indicates that there are people who have VERY much money.
Let's move on. How old are our billionaires?
We remember that there are outliers in this column, let's clean them up and see the right picture.
df = df[df['demographics.age'] > 0] # drop the records with incorrect age plt.figure(figsize=(15,10)) sns.distplot(df['demographics.age'], bins=15) plt.show()
The distribution is similar to normal, with a slightly larger tail on the left.
Let's do the same with the splitting by industry.
plt.figure(figsize=(15,10)) g = sns.FacetGrid(data=df, hue='wealth.how.industry', aspect=3, height=4) g.map(sns.kdeplot, 'demographics.age', shade=True) g.add_legend(title='wealth.how.industry')
industries = ['Hedge funds', 'Consumer', 'Technology-Computer'] plt.figure(figsize=(15,10)) g = sns.FacetGrid( data=df[(df['wealth.how.industry'] != '0') & (df['wealth.how.industry'].isin(industries))], hue='wealth.how.industry', aspect=3, height=4) g.map(sns.kdeplot, 'demographics.age', shade=True) g.add_legend(title='wealth.how.industry')
You can see the money going to the older part on the dataset. In addition, it can be seen that tech companies are more skewed towards the young, while the consumer industry is towards the elderly. There is also an industry where for some reason one can get rich before 20.
Plot qualitative data
Categorical features cannot be visualized through histograms. Instead, you can use the bar plots.
Let's answer the question — what industry are the richer billionaires in?
city = df['wealth.how.industry'].value_counts(ascending=False) df_city = df.filter(['wealth.how.industry'], axis=1) df_city['count'] = 1 grouped_city = df_city.groupby('wealth.how.industry', as_index=False,sort=False).sum() grouped_city.sort_index(ascending=False) grouped_city = grouped_city.sort_values('count', ascending=False) plt.figure(figsize=(15,8)) sns.barplot(data=grouped_city, x='count', y='wealth.how.industry') plt.title('Industries of billioners', fontsize=17)
Judging by the plot at the top are industries that target consumers. It is difficult for me to draw any conclusions as to why — but it is this insight that I can tell the business. Besides, there is some industry "0" — we can assume that these are people who simply don't have industry or it's mixed.
Who are the more men or women among the billionaires?
plt.figure(figsize=(7,5)) sns.countplot(data=df, x='demographics.gender') plt.title('Gender', fontsize=17)
It just so happens that it's mostly men.
Let's try to see the billionaire countries.
column = 'location.citizenship' fig = go.Figure(data = [ go.Pie( values = df[column].value_counts().values.tolist(), labels = df[column].value_counts().keys().tolist(), name = column, marker = dict(line = dict(width = 2, color = 'rgb(243,243,243)')), hole = .3 )], layout=dict(title = dict(text="Billionaire countries")) ) fig.update_traces(textposition='inside', textinfo='percent+label') fig.show()
More than a third of billionaires come from the United States.
As you can see, some industries, as well as gender, have little data. These are rare classes. They are usually problematic when building models. They can either lead to class imbalance (depending on what will be a target) and therefore to an overfit model.
The box plot (a.k.a. box and whisker diagram) is a standardized way of displaying the distribution of data based on the five-number summary:
- First quartile
- Third quartile
The above-mentioned information on the boxplot could be almost as easy to assess if it were presented in a non-graphic format. a boxplot is useful because in practice all of the above and much more can be seen at a glance.
One can notice the symmetry of distribution and possible signs of "fat tails". You can estimate the symmetry by noticing if the median is in the center of the box and if the lengths of the whiskers are equal. In the skewed distribution, the median will be pushed in the direction of the shorter whiskers. Also, box plots help to find outliers in the data — the data that stands out the most from the others.
Let's go through all the quantitative data and build their box plots.
rank — it appears to show a human rank in the overall sample.
year. We can see the period of time during which the billionaires are collected. You can see that he's been very skewed to recent years, which seems logical — if you could earn the first billion a long time ago, then in time you should probably stack more and you're unlikely to leave this list.
company.founded. A similar conclusion, you can also see that there are likely to be some missing values. We'll have to deal with them later.
demographics.age. A lot of outliers, you can see that there are people with zero or negative age, which is not right. If you throw away such outliers, you may suspect that there is something near-normal in this variable distribution. We should build a distplot for this variable.
location.gdp. It is difficult to say something on this graph — it seems that most billionaire countries are not very rich, but it is difficult to judge what this column means exactly.
wealth.worth in billions. A huge number of outliers, although by quarters we can say that most have close to zero number of billions that we have already seen in the previous plots.
In the simplest box plot, the central rectangle spans the first quartile to the third quartile (the interquartile range or IQR). As a rule, outliers are either 3×IQR below the third quartile or 3×IQR above the first quartile. But the definition of the outlier will be different for each data set.
Boxplot is very good at presenting information about the central tendency, symmetry, and variance, although they can mislead aspects such as multimodality. One of the best applications of boxplot is in the form of a side-by-side boxplot (see multivariate graphical analysis below).
We've already talked about what correlation is and why it's needed. In short, correlations allow you to look at the relationship between numerical characteristics and other numerical characteristics. To recap, the correlation takes values from -1 to 1. A negative correlation means that as one feature increases, the other decreases, a positive correlation means that as one feature increases, the other increases. 0 no relationship.
Correlation research is very useful when looking at data, as it gives an idea of how the columns relate to each other.
Correlation heatmaps help you visualize this information.
cols = ['rank', 'year', 'company.founded', 'demographics.age', 'location.gdp'] sns.set(font_scale = 1.25) sns.heatmap( df[cols].corr(), annot = True, fmt = '.1f' ) plt.show()
In general, you should look out for:
- Which features are strongly correlated with the target variable?
- Are there interesting or unexpected strong correlations between other features?
I don't see any interesting insights with my data.
Don't be afraid to try more
It is clear that those simple visualizations which are described here cannot always describe and show all aspects of the data and answer all questions. So don't be afraid to experiment and try other concepts.
The full notebook can be found here
By the end of your Exploratory Data Analysis step, you'll have a pretty good understanding of the dataset, some notes for data cleaning, and possibly some ideas for feature engineering.