Any data management system belongs to one of two types.
The first one is schema-on-write.
Probably a lot of you have already worked with relational databases. And you understand that the first step to working with a relational database is to create tables and configure the schemes. Once we have configured the schemas, created the tables, we can begin to ingest the data. It is likely to be something like bulk upload data from a text file whose structure we know in advance — because it somehow matches the schema of the tables. And, once the data is loaded into the table, we can begin to execute analytical queries on our tables. This is the schema-on-write — the approach in which we define the columns, data format, relationships of columns, etc. before the actual data upload.
But there is an unfortunate problem — we can't upload data until the table is created and we can't create tables until we understand the schema of the data that will be in this table. This is impossible until we understand the entities that this data represents to correctly reflect their relationships in the tables. This also leads to problems with changing the data. For example, a source text file has changed, or someone has added data or changed the column type. Then we need to drop the table and load all the data again. This is normal when we are talking about a small amount of data, it is normal if there are no foreign keys. But if there are external keys, and we have, for example, 500 GB of data? That would be a real problem, and with that approach, it would take days to make such simple changes.
So why do we need this approach?
Once we have completed ETL process of loading our data, the subsequent data reads will be fast and determinant (everyone will receive the same strictly defined data that corresponds to the original schema). But remember that we already paid a penalty for that when we first loaded the data. Also, the downside of the strictly defined schema is that the data has been modified and structured for a specified limited purpose and cannot be reused for future uses that we do not know yet.
With the rise of Big Data with the problems with traditional methods and growing data volumes, another approach was born. Here we upload data as it arrives without any changes or transformations. In fact, with this approach, we completely remove the original ETL process and soothe the nerves from understanding the original data patterns and their structure.
It is a schema-on-read, it has fast data ingestion because data shouldn't follow any internal schema — it's just copying/moving files. This type of data handling is more flexible in case of big data, unstructured data, or frequent schema changes.
So if we analyze the data and try to understand its structure and figure out another way to interpret it, we can simply change the code that is handling the data. We do not need to change the schemas and reload all the data in the data storage.
But since the data does not go through strict ETLs and transformation into strict data storage schemas, there can be a lot of missing or invalid data, duplicates, and many other problems that can lead to inaccurate or incomplete query results.
However, you should understand that some level of schema design is inevitable. One way or another, it is necessary both for understanding the data structure in order to search for information in the data, to check the incoming data, and to manage the data. But not all data falls under this process — the some data is not clear to the business itself and you should not try to model and manage it in a formal way before you can use it.
Which way is better?
As always there is no silver bullet. Like most things in engineering, it depends on the way you use it. Schema-on-write helps to execute the query faster because the data is already loaded in a strict format and you can easily find the desired data. However, this requires much more preliminary preparation and continuous transformation of incoming data, and the cost of making changes to the schema is high and should be avoided.
Schema-on-read, on the other hand, can provide flexibility, scalability, and prevent some human mistakes. It is generally recommended that data is stored in the original raw format (just in case) and optimized in another format suitable for further data processing. The ETL may have invalid transformations, and the original raw data will allow you to return to the original data and process it correctly. In addition, since the schema must be defined when querying data, SQL queries tend to be very complex. They take time to write, and even more to complete.
It is worth noting that some newer approaches, such as using Apache Avro may abstract changes to the schema through version control (a formatting mechanism rather than VCS) or even through schema registry. But the existing types don't go anywhere, only the approach changes.
Buy me a coffee