My friend asked me an interesting question about what skills are worth learning for Data Management specialists and how to build a grow roadmap.
In fact, the question made me think because I haven't had a clear picture in my head. It's just my thoughts on the topic and for the most part, I'm just speculating about the current state and the future of Data Management.
In the beginning, as in any other area, there are basic things that any Software Engineer should know.
In short, I assume that the person who came to the Big Data already knows some programming language and has an idea about such basic things as CS algorithms, SQL, VCS, SDLC, Networking fundamentals, Linux basics, CI/CD, etc. One way or another the common practices of software engineering do not go anywhere — this is the basis that I think is needed in any field. If you lack them then you better learn them, but perhaps you can argue with me here.
Big Data Basics
This is another layer of knowledge that I think is the basic one. Here I could highlight programming languages, for example, because in the Big Data world there are predominant languages without which it's just hard (or impossible) to do something. These are of course JVM languages such as Java, Scala may soon be joined by Kotlin, and, of course, Python as the second best language for everything and the de-facto language for Data Science.
Aside from languages, there are some terms and fundamental concepts behind the Big Data that I think should be learned such as horizontal scaling, distributed systems, asynchronous communication, CAP, eventual consistency, consensus problems, availability, durability, reliability, resiliency, fault tolerance, disaster recovery, etc. There is no need for a deep understanding of them but if you want to succeed in the field you better Google it at least.
Now let's talk about how the field is developing, to be more precise, the challenges it faces.
Analytics and BI is an approach for making data-driven decisions and provides information that can help businesses. Using this level of technology, you can launch queries to answer questions that a business asks, slice and dice data, build dashboards, and create clear visualizations.
With more data, you can make both more accurate forecasts and more reliable solutions and build new ones, climbing higher and higher on the ML ladder.
Machine learning, specific methods of analytics that allows you to create models that can analyze large and complex data and make predictions or decisions without being explicitly programmed to do so. More and more organizations use ML methods to complement their daily operational analytics and normal business operations.
In the past, ML was limited to some extent by the fact that data scientists could not evaluate and test the solution before a team of data engineers deploys it into production. In fact, most organizations had a traditional BI/analytics team, followed by a separate data science team and data engineers team. These skillsets have now begun to overlap, and those teams work together more thoughtfully as Big Data slowly crawls towards analytics and building knowledge based on Big Data. Because Big Data is too big without the help of ML. Therefore, understanding at least the basic concepts of ML as I think is required. Of course, special attention should be paid to the things on which it relies like statistics, ML approaches optimization methods, Bias/Variance, different metrics to understand (this is actually important), etc. In applied Machine Learning basically you need to understand why everything works, the formulas are not important, but usually, those who do not understand the language behind the models make very stupid mistakes.
There's a lot more to say, I'll leave it for another time. There are many areas inside ML — NLP, CV, Recommender Systems, knowledge representation, etc., but usually, when you understand at least the beginning, you already understand what you do not understand, so of course, you can go as deep as you want.
If you want to be a machine learning engineer, make sure you know Python. It's the lingua franca of Machine Learning. Then it's worth learning to understand different kinds of frameworks for working with data like NumPy, Pandas, Dask, and the already mentioned Apache Spark. And of course, the most popular ML libraries: Scikit-Learn and XGBoost.
I think everybody understands that really important directions in ML in one way or another have long been related to Deep Learning. Classic algorithms do not go anywhere of course, in most cases, they are enough to make a good model, but the future, of course, lies in the neural nets. The magic of deep learning is that it becomes better with more data. Plus, it worth mentioning words like transfer learning, 1cycle policy, Cuda, and GPU-optimizations can be added here.
The most popular libraries here are Pytorch with fast.ai, and TensorFlow with Keras (I think we will bury it in the near future, but anyway).
Another thing worth saying about is Distributed ML. As I said Big Data is slowly headed towards more sophisticated analytics on big data. Large datasets stored in a central repository require huge processing and computational demands, so distributed ML is the right direction, although it has a lot of problems (to be discussed next time).
I personally interested in this approach, but it really doesn't matter to anyone but huge corporations. Model accuracy is incredibly important to them, which can only be taken from creating huge models with millions of parameters and a huge pile of data. For all the others, as I said the classical algorithms on subsets or pre-aggregated data are quite good for practical applications.
I will mention a couple of tools that work — Spark Mllib, H2O, Horovod, Mxnet, DL4j. Perhaps Apache Ignite.
While organizations generally value real-time data management, not all companies go for analysis of large data in real-time. Reasons may vary — lack of experience or insufficient funds, fear of related problems, or general reluctance of management. However, those companies that implement real-time analytics will gain a competitive advantage.
Tools here are Apache Spark Streaming, Apache Ignite Streaming, AWS Kinesis.
Automation in Data Science
To somehow automate data preprocessing, feature engineering, model selection and configuration, and the evaluation of the results, the AutoML was invented. AutoML can automate those tasks and can give an understanding of where to continue research.
It sounds great, of course, but how effective is it? The answer to this question depends on how you use it. It's about understanding where people are good at and where machines are good at. People are good at connecting existing data to the real world — they understand the business area, they understand what specific data means. Machines are good at calculating statistics, storing and updating state, and doing repetitive processes. Tasks like exploratory data analysis, preprocessing of data, hyper-parameter tuning, model selection and putting models into production can be automated to some extent with an automated machine learning frameworks, but good feature engineering and draw actionable insights can be done by human data scientist that understands the business context. By separating these activities, we can easily benefit from AutoML now, and I think that in the future AutoML as a thing will replace most of the work of a data scientist.
What's more, I think it's noticeable that the industry is undergoing a strong evolution of ML platform solutions (e.g. Amazon Sagemaker, Microsoft Azure ML, Google Cloud ML, etc.) and as ML adoption grows, many enterprises are quickly moving to ready-to-use ML platforms to accelerate time to market, reduce operating costs and improve success rates (number of ML models deployed and commissioned).
Our culture is visual and visualization is becoming more and more a key tool for understanding the Big Data generated every day. Data visualization helps tell stories by guiding the data in an easy to understand the form, highlighting trends and deviations. Good visualization tells the story, removing noise from the data, and highlighting useful information and promising trends for the business.
The most popular tools, as it seems to me here is Tableau, Looker in addition to all the other described above technology stack, which is somehow necessary for building data warehouses, its management, etc.
Tableau is a powerful tabular BI and data visualization tool that connects to data and allows you to perform a detailed, comprehensive analysis, as well as drawing charts and dashboards.
Looker is a cloud-based BI platform that allows you to query and analyze large amounts of data using SQL-defined metrics after configuring visualizations that tell a story about your data.
To solve all the problems described above you need an infrastructure with the right architecture and the right management, monitoring, and provisioning environment for this infrastructure. And that's not all that I include in this section — also orchestration of pipelines and introduction of DevOps practices in various areas of data management.
The construction of microservices has long been a settled issue. One way or another all serious solutions are built on the microservice architecture. And here comes docker containers, k8s, Helm, Terraform, Vault, Consul and everything around it. This has all become a standard without being noticed.
Log management is the process of processing log events generated by different software applications and the infrastructure on which they operate. It can include both the collection, analysis, storage, and search of logs with the ultimate goal of using data for troubleshooting and obtaining business, application, and infrastructure information.
One of the important tools here is ELK, which consists of the following components - Elasticsearch (text search tool), Logstash and Beats (data sending tools), and Kibana (data visualization tool). Together they provide a fully working tool for real-time data analysis. While they're all designed to work together, each of them is a separate project. ELK provides functions for on-line analysis such as report creation, alerts, log search, etc. This makes it a universal tool not only for DevOps but also for the above-mentioned areas.
Another tool, Splunk is a machine data tool that offers users, administrators, and developers the ability to instantly receive and analyze all the data created by applications, network devices in IT infrastructure, and any other machine data. Splunk can receive machine data and turn it into real-time analytics by providing real-time information through charts, alerts, reports, and more.
Pipeline Orchestration — automated management of fault-tolerant workflows.
Most large data solutions consist of repetitive data processing operations encapsulated in workflows. The pipeline orchestration tools help automate these workflows. They can plan jobs, execute workflows, and coordinate dependencies between tasks in a fault-tolerant way.
I used to hear Oozie more and more, now it's mostly Airflow or AWS Step Functions.
In Big Data it is obvious that the future lies in clouds and it is better for anyone interested in data management to understand its concepts.
In addition to programming patterns (Gateway API, Pub/Sub, Sidecars, etc) applied at the cloud level, you will encounter different concepts such as Infrastructure as a code, Serverless, and of course architectural concepts (N-Tier, Microservices, Loose coupling, etc). Personally, it gave me some deeper understanding of the principle of engineering approaches at a higher level and pumped up (a little) the architectural approaches.
There are such clouds as GCP, AWS, and Azure. I think nobody will argue that there are no other options. For example, you decided to choose AWS, but all the clouds are designed in the same way, although they all have their specifics, and not all CSP services match one another.
Where do I start to learn AWS?
So, you go to the AWS Documentation and you see an endless list of services, but it's just the global table of contents of global tables of contents! That's right — Amazon is huge right now. At the time of writing these lines are two and a half hundred services under the hood. It is not realistic to learn them all, and there is no reason to do it at all.
John Markoff says “The Internet is entering its Lego era.” AWS services are similar to lego — you finding the right pieces and combine them together.
In order to highlight the most essential pieces, it is reasonable to say that they were historically the first. They are:
- S3 — storage
- EC2 — virtual machines + EBS drives
- RDS — databases
- Route53 — DNS
- VPC — network
- ELB — load balancers
- CloudFront — CDN
- SQS/SNS — messages
- IAM — main access rights to everything
- CloudWatch — logs/metrics
Then there are modern serverless pieces (Lambda, DynamoDB, API Gateway, CloudFront, IAM, SNS, SQS, Step Functions, EventBridge).
The process of integration and preparation of data migration from an on-prem solution to the cloud is complex and time-consuming. In addition to migrating massive volumes of existing data, companies will have to synchronize their data sources and platforms within a few weeks or months before migration is complete.
In addition to migration, enterprises are preparing disaster recovery plans to be ready for anything without sacrificing business and the obvious solution here is migration to the cloud as well.
Our ML algorithms are fine, but good results do require a large team of data specialists, data engineers, field experts, and more support staff. And while the cost of experts is not constraining enough, our understanding is still primitive. Finally, moving models into production and keeping them up to date is a final hurdle, given that the results created by the model often can be achieved only by using the same expensive and complex architecture used for learning. It should be understood that moving to production is a process and not a step and it starts long before the model development. Its first step is to define the business objective, the hypothesis of the value that can be extracted from the data, and the business ideas for its application.
MLOps — is a combination of technologies and ML processes and approaches to the implementation of developed models in business processes. The very concept emerged as an analogy of DevOps in relation to ML models and ML approaches. DevOps is an approach to software development that allows increasing the speed of implementation of individual changes while maintaining flexibility and reliability through a number of approaches, including continuous development, division of functionality into a number of independent microservices, automated testing and deploying of individual changes, global performance monitoring, a system of prompt response to detected failures, etc.
MLOps, or DevOps for machine learning, allows data science and IT teams to collaborate and accelerate model development and implementation by monitoring, validating, and managing machine learning models.
Of course, there is nothing new here - everyone has been doing it one way or another for a while. Now just a hype word appears behind which there are usually ready-made solutions like Seldon, Kubeflow, or MLflow.
Aside from that, there are a lot of tools to make the ML model to production, the most popular ones aside from what already been said I think TensorFlow Serving and Vowpal Wabbit. It is also worth looking at existing ML platforms to manage the ML application lifecycle from data to experimentation to production: Google TFX, Facebook FBLearner, Uber Michelangelo, AWS Sagemaker, Standford DAWN, TensorFlow Extended, Airbnb's Bighead, etc.
It turned out a lot and it seems that I did not say anything. By no means, I do not consider myself an expert in everything and only speculate on technology here. But as you can see even from my article, many skills overlap in several areas of Big Data and do not end there. Having them, you can not be afraid that you will not find a job.
Do not chase trends — build skills that stay relevant. And the most relevant skills are probably soft skills.
The steep learning curve associated with a lot of data engineering activities becoming a cliff. Developers' hand-coding projects need deep knowledge of many aspects of the organization and a lot of tools and existing solutions.
In data management, we are still on the Wild West, especially in ML...
Buy me a coffee