There are a lot of engineers who have never been involved in the field of statistics or data science. But in order to build data science pipelines or rewrite produced code by data scientists to an adequate, easily maintained code many nuances and misunderstandings arise from the engineering side. For those Data/ML engineers and novice data scientists, I make this series of posts. I'll try to explain some basic approaches in plain English and, based on it, explain some of the Data Science basic concepts.
The whole series:
- Data Science. Probability
- Data Science. Bayes theorem
- Data Science. Probability distributions
- Data Science. Measures
- Data Science. Correlation
- Data Science. The Central Limit Theorem and sampling
- Demystifying hypothesis testing
- Data types in DS
- Descriptive and Inferential Statistics
- Exploratory Data Analysis
Defining the type of variable you are working with is always the first step in the data analysis process. Later on, this makes it easy to determine which type of analysis is the most appropriate.
In its most general form, the data can be divided into quantitative and qualitative.
Quantitative, as the name implies, is a data type where numbers have a mathematical value, they indicate a quantity, amount, or measurement of a characteristic.
When we go to quantitative measures, numbers mean themselves. That is, there is no additional information needed: 1.5 is 1.5, 5 is 5, 100 is 100.
A discrete scale is one that is quantitative, but it does not take up all the space. Let's take the number of children in the family as an example — we may have 1 child, 3 children, 5 children, and even 10, but we cannot have 1.5 or 3.75. That is, these are some point-like discrete values.
A continuous scale is a scale that takes up all the space, it can be anything from -∞ to +∞, can be fractional. For example, we can measure time in days, hours, seconds, milliseconds and so on. The continuous scale is determined throughout all possible values.
Qualitative variables are variables that reflect the property or quality of objects. And numbers here mean not themselves, as in the quantitative case, but they mean some qualities or properties of objects. In other words, they serve as markers for some categories.
For example, let's say we compare people living in one state to people living in another state. We can encode people from California by 1, New Yorkers by 2, one and two wouldn't mean anything except that they denote these categories, which are the center of our analysis.
Qualitative variables are divided into nominal and ordinal types.
Let's take a closer look at what each of these types means. Let's start with the nominal variables, it's the most basic, the easiest scale. The only information it contains is information about an object belonging to a certain class or group. It means that these variables can only be measured in terms of belonging to some significantly different classes, and you will not be able to determine the order of these classes.
For example, we can study people from different states, or people with different colors of eyes — blue eyes, green eyes, brown eyes. These will all be nominal variables — it does not matter what color your eyes are — there is no order in these values.
Ordinal variables differ slightly from nominal variables by the fact that order appears. So, values not only divide objects into classes or groups but also order them in a certain way.
For example, we have grades at school — A, B, C, D, F. And in this case we can say for sure that the person who has A most likely more prepared for the test than the person who received an F. In this case, we cannot say to what extent, but we can say for sure that A is better than D.