Operational Challenges in Big Data

My friend asked me an interesting question about what skills are worth learning for Data Management specialists and how to build a grow roadmap.

In fact, the question made me think because I haven't had a clear picture in my head. It's just my thoughts on the topic and for the most part, I'm just speculating about the current state and the future of Data Management.

Whole series:

To solve all the challenges described in other posts you need an infrastructure with the right architecture and the right management, monitoring, and provisioning environment for this infrastructure. And that's not all that I include in this section — also orchestration of pipelines and introduction of DevOps practices in various areas of data management.

Kubernetes

The construction of microservices has long been a settled issue. One way or another all serious solutions are built on the microservice architecture. And here comes Docker containers, Kubernetes, Helm, Terraform, Vault, Consul, and everything around it. This has all become a standard without being noticed.

Monitoring

Real-time data is often used to visualize application and server metrics. Data changes frequently, and large deltas in metrics tend to indicate a significant impact on the health of systems or an organization. In these cases, projects such as Prometheus can be useful for processing data streams and time-series data visualization.

Log management

Log management is the process of processing log events generated by different software applications and the infrastructure on which they operate. It can include both the collection, analysis, storage, and search of logs with the ultimate goal of using data for troubleshooting and obtaining business, application, and infrastructure information.

One of the important tools here is ELK, which consists of the following components — Elasticsearch (text search tool), Logstash and Beats (data sending tools), and Kibana (data visualization tool). Together they provide a fully working tool for real-time data analysis. While they're all designed to work together, each of them is a separate project. ELK provides functions for on-line analysis such as report creation, alerts, log search, etc. This makes it a universal tool not only for DevOps but also for the above-mentioned areas.

Another tool, Splunk is a machine data tool that offers users, administrators, and developers the ability to instantly receive and analyze all the data created by applications, network devices in IT infrastructure, and any other machine data. Splunk can receive machine data and turn it into real-time analytics by providing real-time information through charts, alerts, reports, and more.

Pipeline orchestration

Most large data solutions consist of repetitive data processing operations encapsulated in workflows. The pipeline orchestration tools help automate these workflows. They can plan jobs, execute workflows, and coordinate dependencies between tasks in a fault-tolerant way.

I used to hear Oozie, now it's mostly Airflow, Dagster, Prefect or AWS Step Functions.

Cloud

In Big Data it is obvious that the future lies in clouds and it is better for anyone interested in data management to understand its concepts.

In addition to programming patterns (Gateway API, Pub/Sub, Sidecars, etc) applied at the cloud level, you will encounter different concepts such as Infrastructure as a code, Serverless, and of course architectural concepts (N-Tier, Microservices, Loose coupling, etc). Personally, it gave me a deeper understanding of the principle of engineering approaches at a higher level and pumped up (a little) the architectural approaches.

There are such clouds as GCP, AWS, and Azure. I think nobody will argue that there are no other options. For example, you decided to choose AWS, but all the clouds are designed in the same way, although they all have their specifics, and not all CSP services match one another.

Data/solutions migrations

The process of integration and preparation of data migration from an on-prem solution to the cloud is complex and time-consuming. In addition to migrating massive volumes of existing data, companies will have to synchronize their data sources and platforms within a few weeks or months before migration is complete.

In addition to migration, enterprises are preparing disaster recovery plans to be ready for anything without sacrificing business and the obvious solution here is migration to the cloud as well.

MLOps

Our ML algorithms are fine, but good results do require a large team of data specialists, data engineers, field experts, and more support staff. And while the cost of experts is not constraining enough, our understanding is still primitive. Finally, moving models into production and keeping them up to date is a final hurdle, given that the results created by the model often can be achieved only by using the same expensive and complex architecture used for learning. It should be understood that moving to production is a process and not a step and it starts long before the model development. Its first step is to define the business objective, the hypothesis of the value that can be extracted from the data, and the business ideas for its application.

MLOps is a combination of technologies and ML processes and approaches to the implementation of developed models in business processes. The very concept emerged as an analogy of DevOps in relation to ML models and ML approaches. Typically, an MLOps system includes platforms for collecting and aggregating data, analyzing and preparing it for ML modeling, tools for performing computation and analytics, and tools for automated transfer of ML models, data, and derived software products between different lifecycle processes. Such unified pipelines partially or fully automate the work tasks of a Data Scientist, Data Engineer, ML engineer, or Big Data developer.

I consider the following to be the most popular MLOps tools:

AWS SageMaker is a cloud-based machine learning platform that allows developers to create, train and deploy ML models in the AWS cloud;
Google's Kubeflow for Kubernetes is a free, open-source ML platform for using machine learning pipelines in a Kubernetes container virtualization environment;
MLFlow is an open-source platform for managing the lifecycle of machine learning, including experimentation, replication, deployment, and a central registry of ML models;
Sacred is a tool for automating ML experiments, from tracking parameters to saving configurations and reproducing results;
DVC is a Git-like open-source version control system for ML projects for local use.

Aside from that, there are a lot of tools to make the ML model to production, the most popular ones aside from what already been said I think TensorFlow Serving and Vowpal Wabbit. It is also worth looking at existing ML platforms to manage the ML application lifecycle from data to experimentation to production: Google TFX, Facebook FBLearner, Uber Michelangelo, AWS Sagemaker, Standford DAWN, TensorFlow Extended, Airbnb's Bighead, etc.