Many people are confused about Data Science and Machine Learning. Let's find out what each of them means.
During the massive spread of technology, humans have generated huge amounts of data. An amount that they are unable to process and visualize. Data on our calls and movements, Internet behavior, shopping preferences, climate processes, and many other things. Businesses can benefit greatly from this data if it is processed correctly.
This is what Data Science is all about. With the help of statistical methods, understanding of a particular domain, and engineering it's possible to find patterns in the data, to extract knowledge from it. But although it is called science, this activity is of purely practical interest — the practical applicability of the results is a must. Here, although it is necessary to understand and explain the phenomena, most important is the practical applicability of knowledge for business purposes that can be derived from the data. And yes, you don't need to be Ph.D. in math to understand that a shelf of chips needs to be placed closer to a shelf of beer to increase profits, but you need to prove that to the business.
Why is data science attracting so much interest?
The main reason is the hidden efficiency contained in the data. Each company collects data. And its analysis gives companies the opportunity to make better products, attract more target customers, and retain them, improve business processes, and much more. All this gives an understanding of hidden processes, optimizations, preferences, personalization through tons of already accumulated data that lie idle. The approaches of Data Science allow us to draw objective conclusions from the available data, free from bias or prejudice(mostly) and make new discoveries from the found patterns. But you should understand that data is not always useful and it doesn't matter how much of it you have.
Both the term definition and the data science practice can vary from company to company. For one company the Data Science team is one BI analyst who plots charts in Excel and does some business reports, for another company it is an end-to-end development department from communication with the client, its data analytics, infrastructure to the construction of a production-ready system. But overall it is clear that data scientist has an understanding of a particular area, mathematical or statistical background has an understanding of how to work with data and understand where and how to look, and they usually do not have any development or engineering experience. With different tasks, they are helped by a data engineer or machine learning engineer (it all depends on the skillset).
If it is written in Python than it is ML. If it is written in PowerPoint than it is AI.
At all times in the past, computers received new capabilities through programming — a human was creating algorithms for the machines, which led to the expected result. This is a deterministic and understandable approach to any understandable and deterministic task that requires some kind of automation.
But sometimes a task, though understandable, has some kind of non-deterministic element in it — the unknown that cannot be eliminated with which we must live. In order to work effectively with such problems, a different approach is needed. And as people say necessity is the mother of invention(but nobody knows who the father is). So machine learning has become that new approach.
In ML, a person only gives the computer some introductory information, but the results of an algorithm are not determined by the person. A person defines a way for the machine to learn, but the machine learns by itself from the data that has been given to it; the machine itself comes up with answers. This is similar to the way you and I are learning.
What is learning for the machine?
The machine learns from the past experience (on data) in relation to a certain class of tasks if its performance in a given task improves on a given metric. This process can also be called adaptation (it may have more context in this process) — the machine adapts its behavior to new information. And this adaptation, which seems to have no human intervention, from time to time gives the impression that the machine is actually learning.
Machine learning is essentially a data analysis method that automates the construction of analytical models using algorithms that iterate through data. Machine learning allows computers to find hidden knowledge without being explicitly program where to look. And this is kind of a key idea.
In fact, we feed data to the algorithm and the result of the program execution will be the logic for handling the new data.
There are three parts that are necessary for a successful ML project:
First, machine learning always starts with data, and your goal is to extract knowledge or understanding from this data. You have a question that you are trying to answer, and you assume that the data can be used to answer that question.
Second, machine learning involves a degree of automation. Instead of trying to collect your knowledge from the data manually, you apply a process or algorithm to the data using a computer so that a computer can help you obtain the necessary knowledge.
Thirdly, machine learning is not a fully automated process. As any practitioner can say, machine learning requires you to make a lot of smart decisions in order to make the whole process successful.
Data Science and Machine Learning process
Let's split the whole DS/ML project to steps and take a closer look where the Data Science part finishes and Machine Learning Part begins.
The Data Science process can be slightly different depending on the goals of the project and the approaches used, but it usually simulates the following.
Finding and determining the goal
It is very important to understand the business problem first. The data scientist should ask the appropriate questions, understand, and define the goals of the problem to be solved. Sometimes it is not always easy as the business itself wants a lot but nothing concrete.
Collecting and storing data
Then he takes care of collecting and scraping data from several sources such as SAP servers, API databases, and online storage. Sometimes all data is already collected in a handy data warehouse, but sometimes you need to make an effort to get the data. The data engineering team is helping here most of the time to construct a reliable data pipelines.
Data processing and cleansing
Regardless of the machine learning algorithms, one would not be able to learn anything from data that contains too much noise or is too inconsistent with the reality: garbage in, garbage out. To make the whole project successful we need to clean the acquired data.
After data acquisition, the data is being processed. This stage includes data cleansing and data conversion. Data cleansing is the most time-consuming process, as it involves processing many complex scenarios. For example:
- conflicting data types
- misspelled attributes
- missing values
- duplicated values
Then understanding what can actually be done with the data is very important for both business and the Data Scientist, so he does the research analysis of the data. With Exploratory Data Analysis, he identifies and refines the choice of variables to be used for the next steps.
It now moves on to the core data science project activities, such as data modeling. He selects one or more potential models and algorithms and selects the metric of the model's performance. He then begins to apply statistical and machine learning methods to the data to determine the model that best meets the business requirements and the task at hand (this can be simple heuristic). He trains models from the available data and tests them to select the most effective model. This is an iterative process and despite everything, it is very creative.
This step is often over-emphasized. It’s very rare that focus of a data scientist will be on making a model 1% better. Typically it’s much more important to get a "good enough" model out the door and in front of users. The "good enough" model in production is 100x better than the +5-10% more performant model in the Jupyter notebook.
Presentation of final results
The trickiest part is not yet finished with visualization and communication. You need to meet with customers and stakeholders again to simply and efficiently deliver business results.
At this stage the project may end — perhaps the business has reached its goal, or POC has not shown a return on investment for the business, and further work is not required.
Then, finally, the most important stage begins — you need to bring the result of the Data Science team in front of the user — to deploy and optimize the model and integrate it into other business processes. This stage can vary greatly depending on the type of data (clickstream, batch processing), on the target platform (AWS, Azure, on-premise, etc.), requirements(SLA, horizontal scalability, etc), on the final stack of technologies and so on.
Essentially it's a normal development cycle with ML activities add-ons — it's necessary to optimize the model, it's necessary to check all edge cases, it's necessary to create a life cycle of model artifacts, it's necessary to test the selected model in the pre-production environment before deploying it in the production environment, which is best practice (you may have to choose a less performant model) and much more. At the end of the day, machine learning is software.
Immediately after the successful deployment of the system and its implementation it is necessary to introduce monitoring systems — and no, here I do not mean just to enable logging — I mean reports and toolbars to get analytics, calculation and generation of selected metrics and possibly associated data, building an A/B testing system and so on.
Of course, I have touched upon a small part of what happens in real projects, but I think it should give you an understanding of the basic activities that make up a regular DS/ML project.
Like all other fields, the date management goes towards the full stacks. Such people who concentrate in some main field for themselves and know how to make all other parts(backend developer who can implement frontend, UI designer who can think for UX, etc). So data scientists go in the direction of engineering — tighten their knowledge of infrastructure, proper code design, and tooling. And data engineers are going towards data science — trying to understand statistics, algorithms, and approaches to work with data. As a result, they meet somewhere in the middle at the stage of Machine Learning Engineer.
As I said before, software engineering, development, and deployment skills are more important in production level ML projects. The "good enough" model in production is better than the more performant model in the Jupyter notebook. That's why I think that data scientists are needed mainly as consultants who can work on several projects, and direct development can be done by machine learning and data engineers. The future of data science is data engineering.