This post serves as an introduction to the Change Data Capture (CDC) practice, rather than a deep-dive on a particular tool.
Imagine we are building a simple web application. In most cases, such projects start with minimal data architecture. For example, a relational database like MySQL or PostgreSQL is more than enough to process and store data that a number of users can work with. They enter queries, update them, close them, correct them, in general, perform a number of actions. It can be a CRM, ERP, automated banking system, billing system, or even POS-terminals, you name it.
However, the information stored in the database can be of interest to many third-party systems, typically analytical systems. The business needs to understand the status of applications or other entities stored in that system — accounts, deposits, manufacturing, human resources, etc. Data plays an important role in nearly every business operation. Therefore, the business generates reports on a periodic basis, which reflects all the main indicators of interest to the business and necessary to make further management decisions.
Reporting and analytics computations are usually very resource-consuming. Queries can take hours to complete, often severely affecting the performance of the system from which the data is retrieved. Another drawback is that sending all this data puts a lot of strain on the network. And finally, the business decisions based on that data are delayed because of the frequency of querying. So, if you update the data every night, that means you won't know what happened yesterday until the next day.
If the systems have a clear period of reduced load (e.g. at night) and this period is enough to unload all the necessary data without affecting the main activity of the system — then using direct queries to the RDBMS is probably an acceptable option. But what if there is no period of reduced load or if the allocated load window is not enough to completely unload the entire changed data?
And here CDC process comes to the rescue. As the name suggests, Change Data Capture will only capture the change in the data, and this is one of the ETL patterns to copy data. It is a mechanism that determines what data we are interested in, i.e. track changes in the source database and apply them to the target database or data warehouse. And in the target database or data warehouse, we can do all types of analysis, report generation, etc. And it doesn’t even impact the source database performance.
As a result, users can work with the original system without any performance degradation, and the management can get the reports they need to make management decisions at any time.
Thus the essence of CDC — provides historical change information for a user table by capturing both the fact that Data Manipulation Language (DML) changes (insert/update/delete) were made and the changed data itself. CDC extracting them in a form in which they can be reproduced in upstream data systems. In the jargon, such data is also called "delta".
You can think of CDC as a mechanism that constantly monitors the original data system for changes, extracts them, and distributes them to upstream systems. Change Data Capture excludes the process of bulk data loading by implementing incremental loading of data in nearly real-time.
So, how does using CDC solve the problems we have mentioned?
Well, first, because you're not running requests with a really high load on a periodic basis, you don't get really spikey load behavior. Secondly, instead of sending all this data and getting big spikes, you have to build large networks to make sure that all the data you want is sent in a timely manner, because the data is sent continuously and in much, much, much smaller batches, you don't have to build as many networks to make it work, and you can save significantly on network costs. And finally, because you're continuously sending data to a data warehouse, the data in the warehouse is up-to-date, which gives you real-time information for making business decisions based on more up-to-date data.
So Change Data Capture is really the optimal way to move data for a wide variety of use cases. It may be moving data into a data warehouse, the data lake, you may be creating an operational data store or a replica of that data in real-time.
Extracting the delta
Extraction of "delta" in transactional systems is a vital task when creating analytical data warehouses, CRM, MDM hubs, disaster recovery architecture, when there is a period of parallel operation, it is also a frequent task in projects for migration from one data system to another.
In its early days, the delta extraction problem was solved with a new timestamp column. Now we know which rows have been updated after a certain time but to do that, we have to read the whole table. And if the table changes frequently, it becomes an expensive and time-consuming process, and there is a non-zero chance that you might lose some of the data.
In order to avoid losing any data, engineers also tried versioning the rows, got about the same result — it works, but it's very resource-intensive. Moreover, in order to implement these methods, it is necessary to change the application logic, but sometimes we can't do that.
All these problems were solved with the appearance of triggers.
A trigger is a special type of stored procedure in the database. The peculiarity of triggers is that the SQL code written in the body of the trigger will be executed every time after some event occurs in the database.
So we create a trigger for a table, which will keep track of all changes and write them to some intermediate table, which will store the necessary changes.
But there is also a pitfall — now any change in the data in this table invokes the trigger. It requires processing power, besides the number of write operations is doubled at best. The complexity of maintaining such a database also increases. Despite the disadvantages, some CDC products are still based on triggers.
Modern approach to CDC
We can recall that databases are in fact transactional, and have a feature such as a database log also known as transaction journal or transaction log. Almost all Database Management Systems have a transaction log file that records all changes and modifications in the database made by each transaction. All we need to do is to access the transaction log file and select from it the changes we want to track.
So in CDC changes are captured by using a capture process that reads changes from the transaction log and places them in corresponding change tables.
The tricky part here is to read that transaction log correctly and track down the changes that we're really interested in. In the past, most products used to have an agent that would set up at the source of the changes, read the transaction log and transfer those changes to a file or directly apply them to the target.
Modern systems read the transaction log remotely, processing it in memory on a separate server, highlighting the necessary changes in it and applying those changes to the target. This architecture does not load the source system and allows for impressive performance that provides the ability to track changes that occur to the data and apply those changes to the target in real-time.
Production-ready CDC system
To build a production-ready CDC system, in addition to extract delta, we must think about several problems before we design such a system:
- Changes must be delivered in the order in which they occurred, otherwise inconsistent states may arise in the downstream systems;
- When it comes to message delivery, delivery guarantees are a must, so the CDC system must support message delivery guarantees at least once and exactly once if the downstream system misses a change event, it can make the state of the whole system inconsistent;
- Finally, there must be support for easy message conversion, because the data formats expected by the downstream systems may be different.
All of those requirements lead to pub/sub architecture with a message bus — any time a change is made to the data in the source system, the source pushes the change to the message bus. The target system is listening to the message bus and can then consume the changes as they arrive.
This solution provides a number of benefits, the main one being scalability. The pub/sub approach allows the source to send as many updates as it likes to the message bus and the target system can scale the number of consumers of this queue accordingly to process the data quicker if necessary.
The second benefit is that the two systems are now decoupled. If the source system wants to change its underlying database or move the particular dataset elsewhere, the target doesn’t need to change as it would with a pull system. As long as the source system keeps pushing messages to the message bus in the same format, the target can continue receiving updates blissfully unaware that the source system has changed anything.