When we talk about working with data, we usually doing it in a system that belongs to one of two types.
The first of them is a schema-on-write.
Probably many of you already have worked with relational databases. And you understand that the first step to working with a relational database is to create tables and setting up schemas. When we set up schemas, created tables then we can start ingesting the data. Most likely it will be something like the bulk upload data from a text file whose structure we know in advance — since it somehow matches the table schema. And, after data loaded into the table, we can start analytical queries on our data. This is the schema-on-read — an approach where we define the columns, data format, column relationships, and so on before directly loading the data.
But there is a vexing problem — we cannot load data until the table is created and we cannot create tables until we understand the data schema that will be in this table. This also leads to problems with data changes. For example, the text file has changed or someone added data or changed the type of column. What we need to do in this situation is to drop the table and load the data again. This is ok when we talk about a small amount of data, it is ok if there are no foreign keys. But if there are foreign keys and we, for example, have 500 GB of data? This will be a real problem and with the same approach, it will take days for such simple changes.
So what does this approach give us?
As soon as we complete the ETL process of loading our data, the subsequent reading of the data will be fast and determinant (everyone will receive the same strictly defined data that corresponds to the original schema). But you should keep in mind, that we already paid a penalty for this when we've loaded the data. Also, the dark side of the strictly defined schema is that the data has been modified and structured for a specific limited purpose and may not be reusable for future use cases.
With the rise of the big data with the onset of problems with traditional methods and increased data volumes, a different approach was born. Where we upload data as it comes to us without any changes and transformations. In fact, in this approach, we completely remove our initial ETL process and calm our nerves from understanding the original data schemas and its structure.
This is schema-on-read, it is distinguished by fast data ingestion since the data does not have to follow any internal schema — it is just copying/moving files. Such type of work with data is more flexible in the case of Big Data, unstructured data or frequent schema changes.
Therefore, if we analyze the data and try to understand it's structure and find another way of interpreting them, we can simply change the code which is processing data. We do not need to change the schemes and reload the entire data warehouse.
But, since the data is not going through strict ETL and data cleansing processes, and also does not pass any verification, this data can have a lot of missing or invalid data, duplicates and many other problems that can lead to inaccurate or incomplete query results.
However, you need to understand that some level of schema design is inevitable. One way or another, it is necessary both for understanding the data structure in order to search for insights in them, for validation of incoming data and data management. But not all data falls under this process — some data is not understandable to the business itself and you should not try to model it and manage it in a formal scheme before it can be used.
Which way is better?
Now, which way is better? Schema-on-Read or Schema-on-Write?
There is no silver bullet as always. Like most things in engineering, it depends on the use case. Schema-on-write helps in faster performance of the query, as the data is already loaded in a particular format and it is easy to find needed data. However, it requires a lot more up-front preparation and ongoing transformation of the incoming data and the costs of making changes to the schema are high and should be avoided.
Schema-on-read, on the other hand, can give flexibility, scalability, and prevention of some of the human errors. It is usually recommended that you store the data in its original raw format (just in case) and optimize it in a different format, which is convenient for the further data processing mechanism. An ETL may have incorrect conversions, and the original raw data will allow you to return to the source and process the data correctly. In addition, since the structure must be defined when querying data, SQL queries tend to be very complex. They take time to write, and even more, time to complete.
It is worth noting that some newer approaches, such as using Apache Avro can abstract schema changes through version control(format mechanism not VCS), but the general point of view remains the same — there are important differences between these two different approaches that we need to be aware of.
Daily dose of