My friend asked me an interesting question about what skills are worth learning for Data Management specialists and how to build a grow roadmap.
In fact, the question made me think because I haven't had a clear picture in my head. It's just my thoughts on the topic and for the most part, I'm just speculating about the current state and the future of Data Management.
In the beginning, as in any other area, there are basic things that any Software Engineer should know.
In short, I assume that the person who came to the Big Data already knows some programming language and has an idea about such basic things as CS algorithms, SQL, VCS, SDLC, Networking fundamentals, Linux basics, CI/CD, etc. One way or another the common practices of software engineering do not go anywhere — this is the basis that I think is needed in any field. If you lack them then you better learn them, but perhaps you can argue with me here.
Big Data Basics
This is another layer of knowledge that I think is the basic one. Here I could highlight programming languages, for example, because in the Big Data world there are predominant languages without which it's just hard (or impossible) to do something. These are of course JVM languages such as Java, Scala may soon be joined by Kotlin, and, of course, Python as the second best language for everything and the de-facto language for Data Science.
Aside from languages, there are some terms and fundamental concepts behind the Big Data that I think should be learned such as horizontal scaling, distributed systems, asynchronous communication, CAP, eventual consistency, consensus problems, availability, durability, reliability, resiliency, fault tolerance, disaster recovery, etc. There is no need for a deep understanding of them but if you want to succeed in the field you better Google it at least.
Now let's talk about how the field is developing, to be more precise, the challenges it faces.
This layer of knowledge in one form or another refers to the problems that have to be solved when working with data and how you can imagine — where you have big data you have big problems.
When you're working on one or another layer of data processing, you'll need some particular set of skills. Let's dive into them.
The majority of Big Data pipelines consist of the following layers:
With the increasing amount of stored information, the problem of Data Storage has historically been the first. This is the foundation of any system that works with data — there are many technologies that store massive amounts of raw data coming from traditional sources such as OLTP databases, and newer, less structured sources such as log files, sensors, web analytics, document archives, and media archives. As you can see, these are very different areas with their own specifics and we are collecting data from all of them.
The first thing that immediately pops up is which format to use for storing data, how to structure it optimally, and how to store the data optimally. Here, of course, you can think of Parquet, CSV, Avro formats that are very common in the Big Data world. Also, the use of codecs such as Bzip2, Snappy, Lzo, etc. can be considered. Well, optimizations are basically either a proper partitioning or some storage-specific thing.
The main technology on which this layer is built on is, of course, Hadoop with HDFS — a classic large-scale file system. It has become popular due to its durability and limitless scale on conventional equipment. However, nowadays, more and more data is stored in the cloud or at least in hybrid solutions — organizations are moving from outdated on-premise storage systems to managed services such as AWS S3, GCP GCS, or Azure Blobs.
For SQL solutions, among the popular apps are Hive or Presto or more interesting Data Warehouse solutions, which I think lie above simple SQL engines. We will talk about them a little bit later.
For the NoSQL solution, it is either Cassandra with support for ACID, MongoDB for the document data model, and manageable data sizes, or AWS DynamoDB for the scalable solution if you are in the AWS cloud.
For a graph database, I can recall only Neo4j. It is very well suited for storing graph data or related information such as a group of people and their relationships. Data modeling this type of information in a traditional SQL database is a pain and very inefficient.
A Data Lake is the company's centralized repository that allows storing all structured and unstructured data about the business. Here we store data as is, without structuring it, and run different types of analytics on top.
Nowadays, the digital transformation is actually about applying a data-driven approach to every aspect of the business in an effort to create a competitive advantage. That's why more and more companies want to build their own data lake solutions. This trend is still continuing and those skills are still in need.
The most popular tools here are still HDFS for the on-prem solution and cloud data storage solutions from AWS, GCP, and Azure. Aside from that, there are also some data platforms that are trying to fill several niches and create integrated solutions, for example, Cloudera, Apache Hudi, Delta Lake.
A Data Warehouse can be described as a relational database where processed business data is stored, but aims to be optimized for aggregation requests. In any case, it is the same foundation for building analytics and data-driven decisions as Data Lake and they do not exclude each other, but rather complement each other.
Data Marts are one of the last layers of Data Warehouse solutions designed to meet the requirements of a specific business function. Their ability to pull data from disparate sources and make it available to business users makes them a growing trend in the field of data warehousing.
Trending data warehouse solution includes Teradata, Snowflake, BigQuery, AWS Redshift.
There are Data Warehouses where the information is sorted, ordered, and presented in the form of final conclusions(the rest is discarded), and Data Lakes — "dump everything here, because you never know what will be useful". Data Hub is focused on those who do not belong to either the first or the second category.
The Data Hub architecture allows you to leave your data where it is, providing centralization of the processing but not the storage. The data is searched and accessed right where it is located at the moment. But, because the Data Hub is planned and managed, organizations must invest significant time and energy determining what their data means, where it comes from and what transformations it must complete before it can be put into the Data Hub.
The Data Hub is a different way of thinking about storage architecture. And I bet it will gain some attention in the future — all of the enabling pieces are available today.
To create data storage, you need to ingest data from the various sources into the data layer, whether it is Data Lake or Data Warehouse or just HDFS. The data source can be such systems as CRM like Salesforce, enterprise resource planning systems like SAP, RDBMS like Postgres or any log files, documents, social network graphs, etc. And data can be uploaded either through batch jobs or through real-time streams.
There are many tools for ingestion, one of the common ones is Sqoop. It provides an extensible Java-based framework that can be used to develop drivers used to import data into Hadoop. Sqoop runs on the MapReduce framework in Hadoop, and can also be used to export data from Hadoop to RDBMS.
Another one of the common tools is Flume. It is used when a data stream is entered faster than it can be used. Typically, Flume is used to ingest data streams in HDFS or Kafka, where it can act as a Kafka producer. Multiple Flume agents can also be used to collect data from multiple sources into the Flume collector.
Another tool gaining popularity is Nifi. Nifi-processors are file-oriented and have no schema. This means that some data is represented as a FlowFile (it can be an actual file on disk or some block of data obtained elsewhere). Each processor is responsible for understanding the content of the data to work with them. So if one processor understands format A and the other only understands format B, you may have to convert the data format between the two processors.
And one of the de-facto standards in the message bus world is Kafka — an open-source streaming messaging bus that can create a feed from your data sources, partition the data, and stream it to consumers. Apache Kafka is a mature and powerful solution used in production on a huge scale.
Thanks to the data ingestion pipelines, the data is fed into the data layer. Now you need technologies that can process large amounts of data to facilitate analysis and crunch this data. Data analysts and engineers want to run queries against big data, which requires huge computing power. The data processing layer must optimize the data to facilitate efficient analysis, and provide a computational engine to execute the queries.
The most popular pattern here is ETL(Extract Transform Load) — a popular data processing paradigm. Essentially we extract data from a source(s), clean it up, and convert it into the structured information that we uploading to a target database, data warehouse, or data lake.
And one of the tools that successfully implements this pattern is Apache Spark. This is one of my most important big data multi-tools which should already be in the hands of anyone who is dealing with large amounts of data. It performs parallelized queries for structured or unstructured data on Hadoop, Mesos, Kubernetes, or others. Spark also provides an SQL interface and has good streaming and built-in ML capabilities.
Currently, there is movement from ETL to ELT, when the transformations take place inside the data warehouse and not upfront. As it seems to me, this comes from a lack of knowledge about the data, because traditionally there is a lot of planning and rigor on what had to go into the data warehouse to make it stable and be accessible for the users. Then comes changes in the input data format, format of the output structure, etc.
Tools such as Snowflake, AWS Redshift allow creating an abstraction layer over the loaded data (even unstructured) to give a simple SQL API over the data and forget about the letter T.
Batch to real-time
It is now clear that real-time data collection systems are rapidly replacing batch ETLs, making streaming data a reality. More and more both ingestion and processing layers are moving to real-time, which in turn pushes us to learn new concepts, to use multitools that can both do a batch and real-time processing such as Spark and Flink.
Because memory becomes cheaper and enterprises rely on real-time results, in-memory computing enables them to have richer, more interactive dashboards that deliver the latest data and are ready to report almost instantly. By analyzing data in-memory rather than the hard drive, they can get an instant view of the data and act on it quickly.
For the most part, all known solutions already use or try to use this approach. Here again, the most understandable example is Spark and the implementation of a data grid such as Apache Ignite.
Apache Arrow combines the benefits of columnar data structures with in-memory computing. It provides the performance benefits of these modern techniques while also providing the flexibility of complex data and dynamic schemas. I actually don't know any other such format.
Another area of knowledge, which is fundamentally situated in a slightly different plane but directly related to the data. Management challenges tackle privacy, security, governance, and data/metadata management.
Search and information retrieval
The information retrieval system is a network of algorithms that facilitate the search for relevant data/documents on user demand.
In order to perform an effective search on large amounts of data, it is not advisable to perform a simple scan — and then various tools and solutions appear. One of the most frequent tools that I see is ElasticSearch. It is used for Internet search, log analysis, and large data analysis. ElasticSearch is more popular because it is easy to install, scalable to hundreds of nodes without any additional software, and easy to work with thanks to its built-in REST API.
In addition, the well-known tools are Solr, Sphinx, and Lucene.
This is probably one of the important areas of Big Data that is still undervalued and does not have good solutions in my opinion. The goal of data governance is to establish methods, responsibilities, and processes to standardize, integrate, secure, and store data. Without effective data governance, data inconsistencies in different systems of the organization will not be eliminated. This can complicate data integration and create data integrity issues that affect the accuracy of business intelligence, corporate reporting, and analytics applications.
I'm certainly not an expert in this field, but the tools I've seen here are Informatica, Teradata, Semarchy.
Constantly increasing volumes of data pose additional challenges to their protection against intrusions, leaks, and cyberattacks, as the level of data protection does not keep pace with the growth of data, vendors, and people. Comprehensive, end-to-end protection involves not only encrypting data throughout its lifecycle — at rest and in transit — but also securing it from the very beginning of the project. As you can see this affects all the aspects we are talking about in this article and, like everything about information security, it is difficult to do right.
The emergence of privacy laws such as GDPR, CCPA, LGPD has serious consequences for non-compliance. Businesses must take the confidentiality of data into account. And the presence of specialists in these areas becomes a necessity.
Among the tools worth paying attention to are Apache Kerberos, Apache Ranger, Apache Accumulo.
Typically, within the company, we have a lot of data in different forms, storages, formats, with different degrees of access to them. To find the data you need to either know exactly where to find it or know where to start looking (if there is such a place). This is where the so-called data directory or Data Catalog comes into play.
Management of corporate data sources is an essential process that is based on information known to various limited groups within a company. But it is not easy to go and collect all metadata about the data stored inside the organization and manage it — people come and go, data is removed and added. Hence, building data catalogs is an important but complex task.
Among the well-known tools here are Apache Atlas and solutions from cloud providers.
Analytics and BI is an approach for making data-driven decisions and provides information that can help businesses. Using this level of technology, you can launch queries to answer questions that a business asks, slice and dice data, build dashboards, and create clear visualizations.
With more data, you can make both more accurate forecasts and more reliable solutions and build new ones, climbing higher and higher on the ML ladder.
Machine learning, specific methods of analytics that allows you to create models that can analyze large and complex data and make predictions or decisions without being explicitly programmed to do so. More and more organizations use ML methods to complement their daily operational analytics and normal business operations.
In the past, ML was limited to some extent by the fact that data scientists could not evaluate and test the solution before a team of data engineers deploys it into production. In fact, most organizations had a traditional BI/analytics team, followed by a separate data science team and data engineers team. These skillsets have now begun to overlap, and those teams work together more thoughtfully as Big Data slowly crawls towards analytics and building knowledge based on Big Data. Because Big Data is too big without the help of ML. Therefore, understanding at least the basic concepts of ML as I think is required. Of course, special attention should be paid to the things on which it relies like statistics, ML approaches optimization methods, Bias/Variance, different metrics to understand (this is actually important), etc. In applied Machine Learning basically you need to understand why everything works, the formulas are not important, but usually, those who do not understand the language behind the models make very stupid mistakes.
There's a lot more to say, I'll leave it for another time. There are many areas inside ML — NLP, CV, Recommender Systems, knowledge representation, etc., but usually, when you understand at least the beginning, you already understand what you do not understand, so of course, you can go as deep as you want.
If you want to be a machine learning engineer, make sure you know Python. It's the lingua franca of Machine Learning. Then it's worth learning to understand different kinds of frameworks for working with data like NumPy, Pandas, Dask, and the already mentioned Apache Spark. And of course, the most popular ML libraries: Scikit-Learn and XGBoost.
I think everybody understands that really important directions in ML in one way or another have long been related to Deep Learning. Classic algorithms do not go anywhere of course, in most cases, they are enough to make a good model, but the future, of course, lies in the neural nets. The magic of deep learning is that it becomes better with more data. Plus, it worth mentioning words like transfer learning, 1cycle policy, Cuda, and GPU-optimizations can be added here.
The most popular libraries here are Pytorch with fast.ai, and TensorFlow with Keras (I think we will bury it in the near future, but anyway).
Another thing worth saying about is Distributed ML. As I said Big Data is slowly headed towards more sophisticated analytics on big data. Large datasets stored in a central repository require huge processing and computational demands, so distributed ML is the right direction, although it has a lot of problems (to be discussed next time).
I personally interested in this approach, but it really doesn't matter to anyone but huge corporations. Model accuracy is incredibly important to them, which can only be taken from creating huge models with millions of parameters and a huge pile of data. For all the others, as I said the classical algorithms on subsets or pre-aggregated data are quite good for practical applications.
I will mention a couple of tools that work — Spark Mllib, H2O, Horovod, Mxnet, DL4j. Perhaps Apache Ignite.
While organizations generally value real-time data management, not all companies go for analysis of large data in real-time. Reasons may vary — lack of experience or insufficient funds, fear of related problems, or general reluctance of management. However, those companies that implement real-time analytics will gain a competitive advantage.
Tools here are Apache Spark Streaming, Apache Ignite Streaming, AWS Kinesis.
Automation in Data Science
To somehow automate data preprocessing, feature engineering, model selection and configuration, and the evaluation of the results, the AutoML was invented. AutoML can automate those tasks and can give an understanding of where to continue research.
It sounds great, of course, but how effective is it? The answer to this question depends on how you use it. It's about understanding where people are good at and where machines are good at. People are good at connecting existing data to the real world — they understand the business area, they understand what specific data means. Machines are good at calculating statistics, storing and updating state, and doing repetitive processes. Tasks like exploratory data analysis, preprocessing of data, hyper-parameter tuning, model selection and putting models into production can be automated to some extent with an automated machine learning frameworks, but good feature engineering and draw actionable insights can be done by human data scientist that understands the business context. By separating these activities, we can easily benefit from AutoML now, and I think that in the future AutoML as a thing will replace most of the work of a data scientist.
What's more, I think it's noticeable that the industry is undergoing a strong evolution of ML platform solutions (e.g. Amazon Sagemaker, Microsoft Azure ML, Google Cloud ML, etc.) and as ML adoption grows, many enterprises are quickly moving to ready-to-use ML platforms to accelerate time to market, reduce operating costs and improve success rates (number of ML models deployed and commissioned).
Our culture is visual and visualization is becoming more and more a key tool for understanding the Big Data generated every day. Data visualization helps tell stories by guiding the data in an easy to understand the form, highlighting trends and deviations. Good visualization tells the story, removing noise from the data, and highlighting useful information and promising trends for the business.
The most popular tools, as it seems to me here is Tableau, Looker in addition to all the other described above technology stack, which is somehow necessary for building data warehouses, its management, etc.
Tableau is a powerful tabular BI and data visualization tool that connects to data and allows you to perform a detailed, comprehensive analysis, as well as drawing charts and dashboards.
Looker is a cloud-based BI platform that allows you to query and analyze large amounts of data using SQL-defined metrics after configuring visualizations that tell a story about your data.
To solve all the problems described above you need an infrastructure with the right architecture and the right management, monitoring, and provisioning environment for this infrastructure. And that's not all that I include in this section — also orchestration of pipelines and introduction of DevOps practices in various areas of data management.
The construction of microservices has long been a settled issue. One way or another all serious solutions are built on the microservice architecture. And here comes docker containers, k8s, Helm, Terraform, Vault, Consul and everything around it. This has all become a standard without being noticed.
Log management is the process of processing log events generated by different software applications and the infrastructure on which they operate. It can include both the collection, analysis, storage, and search of logs with the ultimate goal of using data for troubleshooting and obtaining business, application, and infrastructure information.
One of the important tools here is ELK, which consists of the following components - Elasticsearch (text search tool), Logstash and Beats (data sending tools), and Kibana (data visualization tool). Together they provide a fully working tool for real-time data analysis. While they're all designed to work together, each of them is a separate project. ELK provides functions for on-line analysis such as report creation, alerts, log search, etc. This makes it a universal tool not only for DevOps but also for the above-mentioned areas.
Another tool, Splunk is a machine data tool that offers users, administrators, and developers the ability to instantly receive and analyze all the data created by applications, network devices in IT infrastructure, and any other machine data. Splunk can receive machine data and turn it into real-time analytics by providing real-time information through charts, alerts, reports, and more.
Pipeline Orchestration — automated management of fault-tolerant workflows.
Most large data solutions consist of repetitive data processing operations encapsulated in workflows. The pipeline orchestration tools help automate these workflows. They can plan jobs, execute workflows, and coordinate dependencies between tasks in a fault-tolerant way.
I used to hear Oozie more and more, now it's mostly Airflow or AWS Step Functions.
In Big Data it is obvious that the future lies in clouds and it is better for anyone interested in data management to understand its concepts.
In addition to programming patterns (Gateway API, Pub/Sub, Sidecars, etc) applied at the cloud level, you will encounter different concepts such as Infrastructure as a code, Serverless, and of course architectural concepts (N-Tier, Microservices, Loose coupling, etc). Personally, it gave me some deeper understanding of the principle of engineering approaches at a higher level and pumped up (a little) the architectural approaches.
There are such clouds as GCP, AWS, and Azure. I think nobody will argue that there are no other options. For example, you decided to choose AWS, but all the clouds are designed in the same way, although they all have their specifics, and not all CSP services match one another.
Where do I start to learn AWS?
So, you go to the AWS Documentation and you see an endless list of services, but it's just the global table of contents of global tables of contents! That's right — Amazon is huge right now. At the time of writing these lines are two and a half hundred services under the hood. It is not realistic to learn them all, and there is no reason to do it at all.
John Markoff says “The Internet is entering its Lego era.” AWS services are similar to lego — you finding the right pieces and combine them together.
In order to highlight the most essential pieces, it is reasonable to say that they were historically the first. They are:
- S3 — storage
- EC2 — virtual machines + EBS drives
- RDS — databases
- Route53 — DNS
- VPC — network
- ELB — load balancers
- CloudFront — CDN
- SQS/SNS — messages
- IAM — main access rights to everything
- CloudWatch — logs/metrics
Then there are modern serverless pieces (Lambda, DynamoDB, API Gateway, CloudFront, IAM, SNS, SQS, Step Functions, EventBridge).
The process of integration and preparation of data migration from an on-prem solution to the cloud is complex and time-consuming. In addition to migrating massive volumes of existing data, companies will have to synchronize their data sources and platforms within a few weeks or months before migration is complete.
In addition to migration, enterprises are preparing disaster recovery plans to be ready for anything without sacrificing business and the obvious solution here is migration to the cloud as well.
Our ML algorithms are fine, but good results do require a large team of data specialists, data engineers, field experts, and more support staff. And while the cost of experts is not constraining enough, our understanding is still primitive. Finally, moving models into production and keeping them up to date is a final hurdle, given that the results created by the model often can be achieved only by using the same expensive and complex architecture used for learning. It should be understood that moving to production is a process and not a step and it starts long before the model development. Its first step is to define the business objective, the hypothesis of the value that can be extracted from the data, and the business ideas for its application.
MLOps — is a combination of technologies and ML processes and approaches to the implementation of developed models in business processes. The very concept emerged as an analogy of DevOps in relation to ML models and ML approaches. DevOps is an approach to software development that allows increasing the speed of implementation of individual changes while maintaining flexibility and reliability through a number of approaches, including continuous development, division of functionality into a number of independent microservices, automated testing and deploying of individual changes, global performance monitoring, a system of prompt response to detected failures, etc.
MLOps, or DevOps for machine learning, allows data science and IT teams to collaborate and accelerate model development and implementation by monitoring, validating, and managing machine learning models.
Of course, there is nothing new here - everyone has been doing it one way or another for a while. Now just a hype word appears behind which there are usually ready-made solutions like Seldon, Kubeflow, or MLflow.
Aside from that, there are a lot of tools to make the ML model to production, the most popular ones aside from what already been said I think TensorFlow Serving and Vowpal Wabbit. It is also worth looking at existing ML platforms to manage the ML application lifecycle from data to experimentation to production: Google TFX, Facebook FBLearner, Uber Michelangelo, AWS Sagemaker, Standford DAWN, TensorFlow Extended, Airbnb's Bighead, etc.
It turned out a lot and it seems that I did not say anything. By no means, I do not consider myself an expert in everything and only speculate on technology here. But as you can see even from my article, many skills overlap in several areas of Big Data and do not end there. Having them, you can not be afraid that you will not find a job.
Do not chase trends — build skills that stay relevant. And the most relevant skills are probably soft skills.
The steep learning curve associated with a lot of data engineering activities becoming a cliff. Developers' hand-coding projects need deep knowledge of many aspects of the organization and a lot of tools and existing solutions.
In data management, we are still on the Wild West, especially in ML...
Buy me a coffee