For a long time, I didn't understand the concepts of Data Lake and Data Warehouse. I thought it was the same thing — a data storage where I could find the data and process it for my purposes.
I wasn't wrong but there is a difference.
Data Warehouse supports data flow from different operational systems to analysis/solution systems by creating a single repository of data from different sources using various ETL processes.
Data sources can be very diverse and have different data representations, which can lead to divergent information (accounting, billing, banking systems). In addition, the large variety of data models makes it difficult to obtain consolidated reports when a complete picture from all application systems is needed. In general, this is the main reason why Data Warehouse solutions appeared in the first place.
A Data Warehouse can be represented as a relational database where processed business data is stored, but this will not be entirely true, things are a little bit more complicated. Data Warehouse has a complex multi-level architecture called LSA — Layered Scalable Architecture. In fact, LSA implements a logical division of structures with data into several functional levels. The data are copied from level to level and transformed in order to eventually be available as consistent information suitable for analysis.
Primary data Layer or Staging
Here the information is loaded from the source systems in its original quality and the full history of changes is saved. This layer also abstracts the following storage layers from the physical data representation of the data sources, how they are collected, and how modifications are extracted. On this layer, ETL pipelines are usually used to transfer data from source systems to the data warehouse.
Core Data Layer
It is a kind of operational component that performs consolidation, normalization, deduplication, and cleansing of data from different sources, resulting in common structures and keys. This is where the main work with data quality and general transformations take place in order to abstract the consumers from the peculiarities of the logical arrangement of data sources and the need for their comparison. This is the way to ensure data integrity and quality. Transformations and direct inserting of new data are created from the data model.
The data model is a specification of all entities, objects in the corporate data warehouse database. The model defines the entities and relationships between them, the business area, and the entire database structure — from tables and fields within them to partitions and indexes.
Data Mart Layer
On this layer, processed, cleansed, and consolidated data is converted to structures that are easy to analyze and use in BI-dashboards or other consumer systems. Data marts provide different domain-specific views of data and can take information from any previous layers.
This layer controls all the layers described above. It does not contain business data, but it does operate metadata and other data quality structures, allowing for end-to-end data auditing, MDM, Data governance, security, and load management. Monitoring and error diagnostics tools are also available here, which speeds up problem solving.
To sum up, such systems can store reliable facts as well as statistics. Decision-maker in your company can obtain this information at any time when it is needed to meet personal and business needs. In addition to making strategic decisions, it can be useful when it comes to financial management, strategic decisions, and sales. But storing data in the Data Warehouse is costlier and time-consuming. The Data Warehouse is ideal for operational users because of being well structured and easy to use.
Although data warehouses can handle unstructured data, they cannot do so efficiently. When you have a large amount of data, storing all your data in a database or data warehouse can be expensive. In addition, the data that comes into the data warehouses must be processed before it can be stored in some shape or structure. In other words, it should have a data model.
In response, businesses began to support Data Lakes, which stores all structured and unstructured enterprise data on a large scale in the most cost-effective way. Data Lakes stores raw data and can operate without having to determine the structure and layout of the data beforehand. In the case of the Data Lake, the information is structured at the output when you need to extract data and analyze it. At the same time, the process of analysis does not affect the data themselves in the lake — they remain unstructured so that they can be conveniently stored and used for other purposes. This way we get the flexibility that Data Warehouse hasn't.
Thus, the Data Lake differs significantly from the Data Warehouse. However, LSA's architectural approach can also be used in the construction of Data Lake(my representation).
- Raw level stores raw data in various formats (tsv, csv, parquet, json, etc)
- On the operational level(core layer) raw data can be transformed into any required form. While in Raw, data is stored in its native format, here we choose the format that fits best for cleansing. The structure is the same as in the previous layer but it may be partitioned to lower grain if needed.
- Data Mart. Data is transformed into consumable data sets and it may be stored in files or tables. The purpose of the data, as well as its structure at this stage, is already known. Cleansing and transformations should be done before this layer. This layer is optional.
Often Data Lakes is used to store information that is not yet used by analysts but is likely to be useful for the company in the future. However, if data lakes are poorly managed, they quickly accumulate huge amounts of uncontrolled data, most often useless. It is no longer clear where they came from and when, how relevant they are, whether they can be used for analysis or not. This is how data swamps appear — useless and devouring company resources. To prevent the lake from becoming a swamp, the company needs to establish a data management process — data governance. The main part of this process is to determine the correctness and quality of the data even before loading it into the data lake. Therefore, when designing any data lake, first of all, it is necessary to decide on what purposes to build it.
Check out this quick overview video with Adam Kocoloski as he goes through a data lake architecture and explains how the data lakes help to create the ML systems for businesses:
- Different purposes. Data Warehouses are used by managers, analysts, and other business end-users, while Data Lakes are mainly used by Data Scientist and Data engineers. Recall that Data Lake stores mostly raw unstructured and semi-structured data — telemetry, graphics, logs of user behavior, site metrics, and information systems, as well as other data with different storage formats. They are not yet suitable for daily analytics in BI systems but can be used by Data scientists to test new business hypotheses using statistical algorithms and machine learning methods.
- Different processing methods. ETL is a popular data processing paradigm in many popular data warehousing. Essentially we extract data from a source or sources, clean it up, and convert it into the structured information we need and upload it. With Data Lakes we use other paradigm ELT(Extract, Load, Transform) because the transformation takes place in the later stages and only if needed not upfront.
- Different levels of understanding of the data. In Data Lakes data is never rejected because it is stored in an unprocessed format. This is especially useful in an environment with large data if you do not know in advance what information will be obtained from the data analysis. At the same time, the central database(s) is the foundation of the data warehousing environment. Usually, such databases are implemented on RDBMS technology and therefore the necessary in-depth design of the data model is required.
- Different approaches to design. Data Warehouse design is based on relational data handling logic — the third normal form for normalized storage, star or snowflake schemes for storages. When designing the data lake, the Big Data Architect and Data Engineer pay more attention to ETL processes, taking into account the diversity of sources and consumers of information. And the question of storage is solved quite simply — you only need a scalable, fault-tolerant, and relatively cheap file system, such as HDFS or AWS S3.
- Different price. Usually, Data Lake is built on the basis of cheap servers with Apache Hadoop, without expensive licenses and powerful equipment, in contrast to a lot of maintenance costs as well as large costs of design and purchase of specialized platforms for Data Warehouse, such as SAP, Oracle, Teradata, etc.
Buy me a coffee