Many engineers haven't had direct exposure to statistics or Data Science. Yet, when building data pipelines or translating Data Scientist prototypes into robust, maintainable code, engineering complexities often arise. For Data/ML engineers and new Data Scientists, I've put together this series of posts.
I'll explain core Data Science approaches in simple terms, building from basic concepts to more complex ones.
The whole series:
- Data Science. Probability
- Data Science. Bayes theorem
- Data Science. Probability Distributions
- Data Science. Measures
- Data Science. Correlation
- Data Science. The Central Limit Theorem and Sampling
- Data Science. Demystifying Hypothesis Testing
- Data Science. Data Types
- Data Science. Descriptive and Inferential Statistics
- Data Science. Exploratory Data Analysis
Why do we care so much about data types? Here's the thing: if you misclassify a continuous variable as categorical (or vice versa), you're setting yourself up for a world of trouble. Imagine analyzing thousands of rows of data only to realize halfway through that you've got the wrong type for one of your core variables. Suddenly, everything from your storage strategy to your analysis is compromised, and you're redoing work.
For Data Engineers and Data Scientists alike, knowing your data types isn't just about getting numbers in the right places; it's about building solid, reliable pipelines and models that won't break down because of avoidable mistakes. Plus, data types affect how efficiently data can be stored, how fast it's processed, and how accurate the final insights are. So, yeah, it's kind of a big deal.
At the highest level, data splits into quantitative and qualitative types.
Quantitative Data
Quantitative data represents numbers that convey quantity, amount, or measurement. Here, numbers mean exactly what they are: 1.5 is 1.5, 5 is 5, and so forth—no extra interpretation needed.
-
Discrete data consists of distinct, separate values. Think of the number of children in a family: you can have 1, 3, 5, or even 10, but not 1.5 or 3.75. These are fixed, point-like values.
-
Continuous data fills the space with a range of values. Time, for example, can be measured in days, hours, seconds, milliseconds, and beyond. A continuous scale allows for fractional values and spans from -∞ to +∞.
Qualitative Data
Qualitative data reflects qualities or characteristics. Here, numbers serve as labels or markers rather than representing a mathematical value.
For example, consider a group of people from different states. Californians might be represented by 1, and New Yorkers by 2. These numbers don't convey a quantity; they simply label categories that we're analyzing.
Qualitative variables can be either nominal or ordinal.
-
Nominal data categorizes without implying any order. Eye color, for instance—blue, green, brown—is nominal; there's no hierarchy or ranking in the values.
-
Ordinal data, however, does introduce a rank or order. School grades—A, B, C, D, F—are ordinal. An A is better than a C, and while we can't say exactly how much better, the order is clear.