For a long time, I didn't understand the concepts of Data Lake and Data Warehouse. I thought it was the same thing — a data storage where I could find the data and process it for my purposes.
I wasn't wrong but there is a difference.
Data Warehouse has a single repository of data collected from different sources using various ETL processes.
Data sources can be very diverse and have different data representations, which can lead to divergent information (accounting, billing, banking systems). In addition, the large variety of schemas and structures in data sources makes it difficult to obtain consolidated information when a complete snapshot of the data is required from all business sub-systems. In general, this is the main reason for the emergence of Data Warehouse solutions.
You can think of Data Warehouse as a relational database where processed business data is stored, but this will not be entirely true — things are a little bit more complicated. Data Warehouse has a complex multi-level architecture called LSA — Layered Scalable Architecture. LSA implements a logical division of structures with data into several functional levels. The data are copied from level to level and transformed in order to eventually be available as consistent information suitable for analysis.
Primary data Layer or Staging
Here the information is loaded from the source systems in its original quality and the full history of changes is saved. This layer also abstracts the following storage layers from the physical data representation of the data sources, how they are collected and transformed. On this layer, ETL pipelines are usually used to transfer data from source systems to the Data Warehouse.
Core Data Layer
This layer is a somewhat operational component that performs aggregation, normalization, deduplication, and cleansing of data resulting in common structures and columns.
It's the place where the main work with data quality and transformations takes place in order to abstract the consumers from the peculiarities of the logical arrangement of data sources and the need for their comparison. And also to ensure data integrity and quality. All the transformations and updates of the system state are derived from Data Model.
The Data Model is a specification of all entities, objects in the corporate Data Warehouse storage. The model defines the entities and relationships between them, the business area, and the entire database structure — from tables and fields within them to partitions and indexes.
There are different approaches to architecture in this layer, the most famous ones are probably Kimball and Inmon. This is important and has a direct impact on the overall Data Warehouse characteristics. Based on the chosen approach we may or may not need a Data Mart Layer, but let's skip those details.
Data Mart Layer
On this layer, processed, cleansed, and aggregated data is converted to structures that are easy to analyze and use in BI-dashboards or other consumer systems. Very often there is also denormalization of data happening at this level.
Data Marts provide different domain-specific views of data and can take information from any previous layer.
This layer controls all the layers described above. It does not contain any data, but it does operate metadata and other data quality structures, allowing for end-to-end data auditing, MDM, Data governance, security, and load management. Monitoring and error diagnostics tools are also available here, which speeds up problem-solving.
To sum up, such systems can store reliable facts as well as analytical results. Decision-makers in your company can obtain this information at any time when it is needed to meet personal and business needs. In addition to making strategic decisions, it can be useful when it comes to financial management, strategic decisions, and sales.
Despite its advantages, storing and managing data in Data Warehouse is costly and time-consuming. The Data Warehouse is ideal for operational users because of being well structured and easy to use.
Although Data Warehouses can handle unstructured data, they cannot do so efficiently. When you have a large amount of data, storing it in a database or Data Warehouse can be expensive. In addition, the data that comes into the Data Warehouses must be processed before it can be stored in some schema or structure. In other words, it should have a Data Model which is not always possible.
In response, businesses began to support Data Lakes, which stores all structured and unstructured enterprise data on a large scale in one place. And they try to do that in the most cost-effective way.
Data Lakes stores raw data and can operate without having to determine the schema or structure beforehand. In the case of the Data Lake, the end-user should structure the information himself. What is cool is that any retrieval process or analysis does not affect the data stored in Data Lake — it remains in the same exact format as it came. So it can be conveniently stored and used for other purposes — we get the flexibility that Data Warehouse hasn't.
Thus, the Data Lake differs significantly from the Data Warehouse. However, LSA's architectural approach can also be used in the construction of Data Lake(my representation).
- Raw level stores raw data in various formats in their original form (tsv, csv, parquet, json, etc)
- On the operational level(core layer) raw data is transformed into any required form. Here we choose the format that fits best for further processing. The structure is the same as in the previous layer but it may be partitioned to lower grain if needed.
- Data Mart. Data is transformed into consumable data sets and it may be stored in files or tables. The purpose of the data, as well as its structure at this stage, is already known. Cleansing and transformations should be done before this layer.
You can argue with me here about how the Data Mart component relates to Data Lake. And it's true — the data can be taken here not only from Data Lake but also from Data Warehouse. This layer can also be missing, which also makes sense when you have a separate Data Warehouse that takes data from Data Lake. My argument here is that data and Data Mart views must be manageable and they should be coupled with Data Lake and its processes.
Often Data Lakes is used to store information that is not yet used by analysts but is likely to be useful for the company in the future. However, if Data Lakes are poorly managed, they quickly accumulate huge amounts of uncontrolled data, most often useless. It is no longer clear where they came from and when, how relevant they are, whether they can be used for analysis or not. This is how Data Swamps appear — useless and devouring company resources. To prevent the lake from becoming a swamp, the company needs to establish a data management process — Data Governance. The main part of this process is to determine the correctness and quality of the data even before loading it into the data lake. Therefore, when designing any data lake, first of all, it is necessary to decide its purposes.
Check out this quick overview video with Adam Kocoloski as he goes through a data lake architecture and explains how the data lakes help to create the ML systems for businesses:
- Different purposes. Data Warehouses are used by managers, analysts, and other business end-users, while Data Lakes are mainly used by Data Scientists and Data engineers. Recall that Data Lake stores mostly raw unstructured and semi-structured data — telemetry, graphics, logs of user behavior, website metrics, and information systems, as well as other data with different storage formats. They are not yet suitable for daily analytics in BI systems but can be used by Data Scientists to test new business hypotheses using statistical algorithms and Machine Learning methods.
- Different processing methods. ETL is a popular data processing paradigm in many popular data warehousing. Essentially we extract data from a source or sources, clean it up, and convert it into the structured information we need, and upload it. With Data Lakes we use another paradigm ELT(Extract, Load, Transform) because the transformation takes place in the later stages and only if needed not upfront.
- Different levels of understanding of the data. In Data Lakes data is never rejected because it is stored in an unprocessed format. This is especially useful in an environment with large data if you do not know in advance what information will be obtained from the data analysis. At the same time, the central database(s) is the foundation of the data warehousing environment. Usually, such databases are implemented on RDBMS technology and therefore the necessary in-depth design of the data model is required.
- Different approaches to design. Data Warehouse design is based on relational data handling logic — the third normal form for normalized storage, star or snowflake schemes for storage. When designing the data lake, the Big Data Architect and Data Engineer pay more attention to ETL processes, taking into account the diversity of sources and consumers of information. And the question of storage is solved quite simply — you only need a scalable, fault-tolerant, and relatively cheap file system, such as HDFS or AWS S3.
- Different price. Usually, Data Lake is built on the basis of cheap servers with Apache Hadoop, without expensive licenses and powerful equipment, in contrast to a lot of maintenance costs as well as large costs of design and purchase of specialized platforms for Data Warehouse, such as SAP, Oracle, Teradata, etc.
Now new systems are beginning to emerge that address the limitations of both Data Lake and Data Warehouse — Lakehouse(new marketing term from Databricks). The main solutions are Delta Lake from Databricks, Apache Hudi from Uber, Apache Iceberg from Netflix.