Data Challenges in Big Data

#data-engineering

My friend asked me an interesting question about what skills are worth learning for Data Management specialists and how to build a grow roadmap.

In fact, the question made me think because I haven't had a clear picture in my head. It's just my thoughts on the topic and for the most part, I'm just speculating about the current state and the future of Data Management.

Whole series:

This layer of knowledge in one form or another refers to the problems that have to be solved when working with data. Where you have Big data you have Big problems.

When you're working on one or another layer of data processing, you will need some particular set of skills. Let's dive into them.

The majority of Big Data pipelines consist of the following layers:

Data Storage

With the increasing amount of stored information, the problem of Data Storage has historically been the first. This is the foundation of any system that works with data — there are many technologies that store massive amounts of raw data coming from traditional sources such as OLTP databases, and newer, less structured sources such as log files, sensors, web analytics, document archives, media archives and so on. As you can see, these are very different areas with their own specifics, and we need to collect data from all of them to get overall information about the whole system.

The first thing that immediately pops up is which format to use for storing data, how to structure it optimally, and how to store the data optimally. Here, of course, you can think of Parquet, CSV, and Avro formats that are very common in the Big Data world. Also, the use of codecs such as Bzip2, Snappy, Lzo, etc. can be considered. Well, optimizations are basically either proper partitioning or doing some storage-specific things.

One of the main technologies on which this layer is built using Hadoop with HDFS — a classic large-scale file system. It has become popular due to its durability and limitless scale on conventional equipment. However, nowadays, more and more data is stored in the cloud or at least in hybrid solutions — organizations are moving from outdated on-premise storage systems to managed services such as AWS S3, GCP GCS, or Azure Blobs.

For SQL solutions, among the popular projects are Hive, Apache Drill, Apache Impala, Apache Spark SQL, and Presto. Also, there are more interesting Data Warehouse solutions, which I think lie above simple SQL engines. We will talk about them a little bit later.

For the NoSQL solution, it is either Cassandra with support for ACID, MongoDB for the document data model, and manageable data sizes, or AWS DynamoDB for the scalable solution if you are in the AWS cloud.

For a graph database, I can recall only Neo4j. It is very well suited for storing graph data or related information such as a group of people and their relationships. Data modeling of this type of information in a traditional SQL database is a pain and very inefficient.

Data Lake

A Data Lake is the company's centralized repository that allows for storing all structured and unstructured data about the business. Here we store data as is, without structuring it, and run different types of analytics on top.

Nowadays, digital transformation is actually about applying a data-driven approach to every aspect of the business in an effort to create a competitive advantage. That's why more and more companies want to build their own data lake solutions. This trend is still continuing and those skills are still in need.

Vendor selection for the Hadoop distribution may be driven by the client most of the time, depending on their personal bias, market share of the vendor, or existing partnerships. The vendors for Hadoop distribution for the on-prem clusters are Cloudera, Hortonworks, Mapr, and BigInsights. On-prem is considered more secure. And banks, insurance companies, and medical institutions like it a lot because the data does not leave their premises. However, acquiring and maintaining infrastructure will cost much more — in terms of time and effort.

There are also cloud storage solutions from AWS, GCP, and Azure. Cloud solutions provide more flexibility in terms of scalability and ready-to-use resources compared to on-prem solutions but have high maintenance costs.

Aside from that, there are also some data platforms that are trying to fill several niches and create integrated solutions, for example, Apache Hudi, Delta Lake.

Data Warehouse

A Data Warehouse can be described as an ordered repository of data that can be used for analysis and reporting and aims to be optimized for aggregation requests. In any case, it is the same foundation for building analytics and data-driven decisions as Data Lake and they do not exclude each other, but rather complement each other.

Data Marts are one of the last layers of Data Warehouse solutions designed to meet the requirements of a specific business function. Their ability to pull data from disparate sources and make it available to business users makes them a growing trend in the field of data warehousing.

Trending data warehouse solution includes Teradata, Snowflake, BigQuery, AWS Redshift.

Data Hub

There are Data Warehouses where the information is sorted, ordered, and presented in the form of final conclusions(the rest is discarded), and Data Lakes — "dump everything here, because you never know what will be useful". Data Hub is focused on those who do not belong to either the first or the second category.

The Data Hub architecture allows you to leave your data where it is, providing centralization of the processing but not the storage. The data is searched and accessed right where it is located at the moment. But, because the Data Hub is planned and managed, organizations must invest significant time and energy in determining what their data means, where it comes from, and what transformations it must complete before it can be put into the Data Hub.

The Data Hub is a different way of thinking about storage architecture. And I bet it will gain some attention in the future — all of the enabling pieces are available today.

Data Ingestion

To create data storage, you need to ingest data from various sources into the data layer, whether it is a Data Lake or a Data Warehouse, or just HDFS. The data source can be such systems as CRM like Salesforce, enterprise resource planning systems like SAP, RDBMS like PostgreSQL, or any log files, documents, social network graphs, etc. Data can be uploaded either through batch jobs or through real-time streams.

There are many tools for ingestion, one of the common ones is Sqoop. It provides an extensible Java-based framework that can be used to develop drivers used to import data into Hadoop. Sqoop runs on the MapReduce framework in Hadoop, and can also be used to export data from Hadoop to RDBMS.

Another one of the common tools is Flume. It is used when a data stream is entered faster than it can be used. Typically, Flume is used to ingest data streams in HDFS or Kafka, where it can act as a Kafka producer. Multiple Flume agents can also be used to collect data from multiple sources into the Flume collector.

Another tool gaining popularity is Nifi. Nifi-processors are file-oriented and have no schema. This means that some data is represented as a FlowFile (it can be an actual file on disk or some block of data obtained elsewhere). Each processor is responsible for understanding the content of the data to work with them. So if one processor understands format A and the other only understands format B, you may have to convert the data format between the two processors.

One of the de-facto standards in the message bus world is Kafka — an open-source streaming messaging bus that can create a feed from your data sources, partition the data, and stream it to consumers. Apache Kafka is a mature and powerful solution used in production on a huge scale.

Data Processing

Thanks to the data ingestion pipelines, the data is fed into the data layer. Now you need technologies that can process large amounts of data to facilitate analysis and crunch this data. Data analysts and engineers want to run queries against Big Data, which requires huge computing power. The data processing layer must optimize the data to facilitate efficient analysis and provide a computational engine to execute the queries.

Computer clusters are better suited to meet the high computational needs of Big Data pipelines. Using clusters requires a solution for managing cluster membership, coordinating resource sharing, and scheduling actual work on worker nodes. It can be handled by software like Hadoop’s YARN, Apache Mesos, or Kubernetes.

The most popular pattern on this layer is ETL(Extract Transform Load) — a popular data processing paradigm. Essentially we extract data from a source(s), clean it up, and convert it into the structured information that we upload to a target database, data warehouse, or data lake.

And one of the tools that successfully implements this pattern is Apache Spark. This is one of the most important big data multi-tools which should already be in the hands of anyone who is dealing with large amounts of data. It performs parallelized queries and transformations for structured or unstructured data on large clusters. Spark also provides an SQL interface and has good streaming and built-in ML capabilities.

ETL to ELT

ETL vs ELT

Currently, there is movement from ETL to ELT, when the transformations take place inside the data warehouse and not upfront. As it seems to me, this comes from a lack of knowledge about the data, because traditionally there is a lot of planning and rigor on what had to go into the data warehouse to make it stable and accessible for the users. Then comes changes in the input data format, format of the output structure, etc.

Tools such as Snowflake, AWS Redshift allow creating an abstraction layer over the loaded data (even unstructured) to give a simple SQL API over the data and forget about the letter T. Another tool that supports all the SQL-related workflows is dbt.

Batch to real-time

It is now clear that real-time data collection systems are rapidly replacing batch ETLs, making streaming data a reality. More and more both ingestion and processing layers are moving to real-time, which in turn pushes us to learn new concepts, to use multitools that can both do a batch and real-time processing such as Spark and Flink.

In-Memory

Because memory becomes cheaper and enterprises rely on real-time results, in-memory computing enables them to have richer, more interactive dashboards that deliver the latest data and are ready to report almost instantly. By analyzing data in memory rather than the hard drive, they can get an instant view of the data and act on it quickly.

For the most part, all known solutions already use or try to use this approach. Here again, the most understandable example is Spark and the implementation of a data grid such as Apache Ignite.

Apache Arrow combines the benefits of columnar data structures with in-memory computing. It provides the performance benefits of these modern techniques while also providing the flexibility of complex data and dynamic schemas. I actually don't know any other such format.

Additional materials

Hadoop: The Definitive Guide — I've read this book several times and I 100% recommend every engineer who wants to work with big data to read it even if it's a little bit old today
Spark: The Definitive Guide
Learning Spark: Lightning-Fast Data Analytics
Check out the talk Big Data Architecture Patterns