When people talk about Big Data, many remember the 3 V's of Big Data - Volume, Velocity, Variety (recently I've heard that a number of V's is now up to 42). Today we're going to talk about Velocity, or put simply, speed.
Velocity is quite a hot topic, especially nowadays when everyone wants to do a lot of things as fast as possible, from simple push notifications to complex analytics on the data stream.
In the streaming world, there is such a thing as an event. By event, we can mean any operation with data in any subsystem. It can be an OLTP database log, or it can be a clickstream. All of it is the stream of events(data stream) in the system.
So, in the past, when life was simpler, we handled batch queries on all of our data streams. The data was an hour or a minute behind, it didn't matter that much. Then it became insufficient — speed was no longer a nice to have but a requirement for companies to survive. Streaming came along. What is the difference between these two approaches?
Batching vs streaming
Batching is an aggregation operation on data, either all or part of it. This is not necessarily analytics, it can be a simple OLTP operation, like updating records in the database, but can be quite resource-intensive as Machine Learning.
Pros and cons of the approach:
➕ it is easy to recalculate data
➕ ability to implement sophisticated logic using powerful tools
➖ significant delays in getting the results
➖ the amount of analyzed data is limited by disk space because we're analyzing data that we already stored somewhere
-Who is that running so fast?
-That's Elusive Joe!
-Wait, that's because nobody can catch him?
-No, that's because nobody cares to do that.
Streaming is a slightly perpendicular approach where we are no longer talking about data as a limited entity but as an infinite stream of events. On this event stream, we are trying to do any kind of aggregation and we don't really need to store the data itself. You can think of it as a pipe in which we put kind of 'filters', and that pipe ends with a sewer, like /dev/null.
Pros and cons of the approach:
➕ aggregation on the fly, it's fast
➕ the volume of processed data is not limited by disk space
➖ difficult to implement complex logic
➖ it is very difficult to correct an error retroactively because the data is already gone
In summary, it is quite obvious that the advantages of one approach are the disadvantages of the other. And perhaps it is worth combining them in some way.
This is the idea behind the Lambda architecture a concept first proposed by Nathan Marz — to combine batching and streaming into one system.
To handle the numerous events occurring in the Lambda system, the architecture takes an event stream and forks/duplicates it into two relatively independent layers - both the Batch Layer and the Speed Layer.
The Batch layer handles the responsibility of managing historical data and recomputing tasks on that data (e.g. map-reduce tasks or machine learning model training). The Batch layer takes incoming data, combines it with historical data, and recomputes the results by iterating over the entire dataset.
At the beginning of the Batch Layer sits Data Storage. Here the original data remains unchanged, it is an archive of raw historical data. Most often it is fact-based model data in a Data Lake based on Hadoop, although it also occurs in the form of Data Warehouse based on Vertica for example. It's important to note that the data remains unchanged here - only new data is added. Even if you lose all your data on the service layer you can reconstruct them from here.
The Batch layer runs on the entire dataset and thus allows the system to give the most accurate results. However, the results are achieved at the expense of high latency due to the long computation time.
The result of this layer will be the so-called Batch Views. These are most often denormalized data views better suited for business logic.
Speed Layer is used to provide a low-latency, near-real-time result. Speed Layer performs incremental updates on data that was not processed in the last batch of the Batch Layer. This compensates for differences in data relevance and adds information with a short lifecycle (it will be deleted on the next batch run anyway) to individual real-time views.
Thanks to this layer decrease the cost of data processing - because we don't need to recalculate all the available data every time and just concentrate on the small portion which came recently. But because we do not do a full reprocessing of the data, the results are likely to be less accurate than on the Batch layer.
Serving Layer is the last component of Lambda Architecture. In this layer, the query aims at merging and analyzing data from both the Batch Layer view and the incremental flow view from the Speed Layer.
It is also often mentioned that this layer is tightly coupled to the Batch Layer. This is because the Batch Layer is constantly updating the Serving Layer views, and these views will always be out of date due to the high latency of the batch processing. But this is not a problem, because the Speed Layer will be responsible for any data that is not yet available in the Serving Layer.
Serving Layer indexes packets and processes the results of the computations taking place in the previous layers and denormalizes them if necessary. Because of the indexing and processing of incoming information, the results may lag in time, it depends on the implementation.
Almost any database can be used here, either resident (in memory) or persistent, including special purpose storages, e.g., for full-text search.
We ended up with a scalable system that is immune to occasional loss or corruption of the data. It also has the ability to re-process the data in case you need to update the business logic or fix a bug. This is possible because the underlying data set is immutable and it is easy to restore the system state based on it.
While the Lambda architecture provides many benefits, it also introduces the complexity associated with the need to align business logic across streaming and batch codebases, which can make debugging difficult. We need to write the same logic in two places with, most likely, different languages. Imagine how much fun it is to roll out new versions and hotfixes. It is not hard to write it on Batch Layer where we have all sophisticated tools available, but transferring it to streaming can be challenging — problems arise even at the level of language translation.
Why do we need a Batch Layer at all? Can we remove it and write everything in just one Speed layer?
If you go back and think about why we added Batch Layer, there are two advantages to it:
➕ it is easy to recalculate data
➕ ability to implement sophisticated logic using powerful tools
Consequently, if you can solve those problems on the Speed Layer — you don't need a Batch Layer at all.
The first problem, in fact, is the lack of Data Storage where our immutable raw data should be stored. If you think about it, the batch approach is a kind of subset of the streaming approach with a bigger time window. So, we just need to have Data Storage and have a set of steps to recalculate the data.
But here Data Storage in the same form as on the Batch Layer of Lambda architecture is not quite suitable. We need an immutable data log. This is a similar concept to the immutable raw dataset in Lambda architecture, but instead of using technologies such as Hadoop/HDFS, Kappa architecture's immutable data log is usually based on Kafka. Basically, we do this because we want to have a sequence of events with timestamps in addition to storing the actual data. You can mirror the Kafka topic on HDFS, so you are not limited to the Kafka configuration.
The output of the pipelines can store everything in a database, so it's nice and easy to get the data and make views out of it.
When you want to do processing, run the second instance of your streaming job that starts processing from the beginning of the stored data, but direct that output to a new table. When the second job catches up, switch the application to read from the new table. Stop the old version of the job and delete the old result table. You can test if the new version works fine, and if not, go back to the old table.
The second problem, unfortunately, is more of a statement of fact — there are not many stable tools on which you can build every streaming solution. The most popular now is probably Apache Samza and Apache Flink.
After all, we can solve these problems and get rid of the Batch Layer for good. Apache Kafka co-creator Jay Kreps did this by proposing the Kappa architecture. The key idea is to process data with a single processing engine. Sweet! The team can develop, test, debug, maintain, and deploy their systems on a single processing framework. As a consequence, the Kappa architecture is composed of only two layers — Speed Layer and Serving Layer.
These are essentially the same layers as the Lambda architecture, with minor additions that solve the problems described above.
While there is a lot of literature describing how to build a Kappa architecture, there are few use cases that describe how to successfully implement it in production.
If you're a fairly large company, your data platform is likely to be used by more than one team. Some teams use data for analytics, which prioritizes fast computation. Other teams value correctness and consistency over a much longer period of time for monthly business intelligence reports. In addition, there will always be some events that are severely delayed for whatever reason or came in the wrong order and won't be captured in the Speed Layer.
Lambda architecture doesn't have these problems because it has a Batch Layer. These problems can be solved with the help of watermarking and backfilling analog with the same Speed Layer but it's rather challenging. Here is Uber describing its approach.
We ended up with a similar architecture to Lambda, with no problem maintaining two codebases - there is only one code to maintain a unique framework. Another advantage of this architecture is that we don't have to process the same data twice(on batch and speed layers), but we only re-process data when we change the code. And that means fewer resources are required due to one way of processing the data.
But on top of that, we have added a couple of very unpleasant problems. The lack of a Batch Layer can lead to errors when processing the information or when updating the database. So in the Kappa architecture, there is a need for an exception dispatcher for data re-processing or validation. Also from the disadvantages, I see that not many good streaming frameworks can be used in this architecture now as well as not all algorithms can be made streaming. In addition, there are a number of problems that will definitely arise in the operation of such a system, which will need to be solved.
So when should we use one architecture or the other? As is often the case, it depends on some peculiarities of the implemented application.
A very simple case is when the algorithms used for real-time and historical data are identical and implementable on streaming. It is then clearly very advantageous to use the same codebase to process historical and real-time data, and hence use the Kappa Architecture.
If the algorithms used to process historical data and real-time data are not always identical. Here, the choice between Lambda and Kappa becomes a tradeoff between the performance benefits of batch processing over a simpler codebase.
Finally, there are even more complex use cases in which even the outputs of the streaming and batch algorithms differ. For example, a machine learning application requires so much time and resources that the best result achievable in real-time is to compute approximate updates to the model. In such cases, the batch and streaming layers cannot be combined, and it is necessary to use Lambda architecture.
I haven't had personal experience with Kappa architecture yet, and everything I know about it is based on other people's experiences. So I'm a bit biased, but it seems to me that Lambda architecture claims to be universal, which it actually justifies: despite its significant limitations, it solves the very dilemma "speed vs reliability" with the least losses. Lambda Architectures are ubiquitous in machine learning and data science applications.
- Big Data: Principles and Best Practices of Scalable Realtime Data Systems by James Warren, Nathan Marz
- Questioning the Lambda Architecture
- Foundations for Architecting Data Solutions
- Streaming Systems by Tyler Akidau, Slava Chernyak, Reuven Lax