History of data warehouses and data lakes

Continuing on the topic of the data storage evolution, let’s shift our focus to the data warehouses, data lakes, lakehouses and their pros and cons.

Going back to the 80-ies we can recall the high level evolution of information of the storage architecture.

In the 80-ies data warehouses took hold and were getting build for the next 20 years
Then in the 2000-ths with the emergence of HDFS Data lakes appeared and became widely popular
In mid 2010-ths cloud data platform appeared that combined the data lakes and data warehouses, adding real-time processing of data
And in 2020-ths technology matured to the point that it allowed data lakehouses to be created, which is an extension of a data lake with the optimized Delta Lake open source framework (https://delta.io/). This brought data versioning on ingest, acid transactions, schema enforcement, combining batch and streaming ingestion etc., deletes/updates/merges, performance improvements with Parquet extensions.

Data warehouse – motivations, pros and cons

Data warehouse is a centralized database where you store data from multiple sources, for historical and trend analysis reporting.
It contains a single version of truth for the organization, and it’s a central repository for many subject areas from multiple sources of data.

Motivations for having a data warehouse

It reduces stress on OLTP databases as it’s optimized for reads and sequential access.
Data warehouse also creates an abstraction of data, protecting the client from source system upgrade/replacement disruptions.
Data warehouse provides a very good chance to restructure and systemize data modeling.
Data warehouse allows better security with role-based authentication
Data warehouse allows adhoc data querying interactively
Data warehouse can integrate an MDM tool and benefit from it – if a lot of time in the data warehouse was spent on data cleansing and data fixes.

Note that an alternative to a centralized MDM could be strong culture of ownership, and solid boundaries between the teams – creating a distributed system where each domain is strongly owned and maintained. Please see Data Fabric later on in the blog.
The downside to this distributed approach is that it requires good integration architecture.

Data mart vs warehouse?

Touching on a point of large data warehouses – there’s also a concept of a data mart.
Data mart is a curated set of datasets for a smaller audience, frequently based off of a dta warehpuse.
Another motivation for a data mart could be solving a performance problem of the data warehouse by exporting out curated datasets to smaller dmarts.

Data lake

Data lake is a data repository with schema-on-read, meaning that the data consumer needs to find the data-set’s schema first before being able to read the data. This is opposite to schema-on-write when the schema is enforced during the data ingress.

This provides flexibility and speed for data producers but puts the onus on the data consumer to figure out the data. This also means the consumer will burn the producer’s cycles later anyway, when trying to make sense of the data.

Data lakes are attractive because they are much less expensive than a data warehouse – per GB of data stored.
As a consequence – a data lake allows data hoarding, just in case one needs it in some future.
The data is stored in its native format in the data lake, which means there will be a variety of serialization approaches.
Data lake will alleviate the data warehouse from a sizable chunk of queries, letting users achieve the same result cheaper.
There are tools like Databricks, Redshift etc. that can do data exploration, refinement and analysis quicker than a data warehouse. Also the data scientists are typically happier with faster access to the data lake than waiting for the data to be onboarded to the data warehouse. This saves time and money by qualifying only valuable data for the ETL and data warehouse storage. This also allows data warehouse to maintain ingestion velocity by keeping less useful data in the lake.
Data lake could be a great place for ETL’s landing/staging phase, allowing re-runs and going back to the source as needed. Typically the data will be separated into dirty and clean zones, for ETL to consume the clean data. Additionally, data lake can accommodate a sandbox, where a copy of the data (potentially masked) is stored for a data analyst to run POCs.
Another benefit of a data lake is that it allows Hadoop tools to be used, high availability and DR (disaster recovery) built-in.
Data lake is a cost efficient place to store IOT streaming data, but – Kafka may be an alternative in certain short-lived scenarios.
Data lake often serves as place for data warehouse archives and older data backups.
Data lake provides great “update” performance for large distributed files.
Data lake allows separation of storage and performance tiers, allowing to scale them separately and more cost efficiently.
Data lake enables location-agnostic access to the data.

Lakehouse

Lakehouse tries to address the common data warehouse complains that the data is stale, costly and less reliable (inconsistent with the data lake data in the organization)

But lakehouse has its cons too: performance is worse than that of the data warehouse’s MPP database, lack of role based access control, no column-based security, lack of dynamic data masking, complexity due to being file-based and metadata being separate from data, no foreign keys/column value constraints, or interactive adhoc querying, no sql which creates a barrier for entry when trying to use the data. Performance investigation tools are not there, whereas they are excellent in the data warehouse world (execution plans).

Data fabric

Data fabric can be though of as an extension to the data warehouse, adding MDM, real-time processing, APIs, metadata catalog, data lineage etc.

Data mesh

Data mesh is a decentralized data repository.

It solves the data ownership problem, as domains keep the data. In a data mesh data becomes a product (vs being a by-product of operations).
Data mesh solves a lot of data quality problems as the data stays with the team that knows it best.
Data mesh allows to scale technically (add more hardware) and organizationally as you can add domains fairly quickly. It also means that there’s no longer a centralized bottleneck of the data warehouse team.

The data mesh cons are:

It hides the big picture
It requires organization-level culture change where each domain would need to own, cleanse, govern their data
It requires more engineers, skilled resources and thus turns into a very large investment on the enterprise level
Duplication of efforts and resources needed to run the domains, it’s tough to find enough skilled resources for each domain (could be counter-balanced by sharing the engineering resources, for example).
Requires global governance and open standards, requires data infrastructure engineering team and a platform for storage, pipeline, catalogue, access control etc.
Difficulties to get domains to prioritize additional external work if they are already busy with internal ongoing efforts
Consumers would need to be trained on how to source the data from separate domains, how to join the data – this could be a big hurdle
Sourcing data from disparate domains will negatively affect performance
Data will be duplicated, even if only to solve performance problems
Synchronizing/timing regulatory or project-level efforts across domains will be hard as everyone has their own pace

Data virtualization

Data virtualization is abstracting data from the datasources directly, bypassing ETL and local storage.
Sometimes data virtualization tools allow writing new data back to the sources. Data virtualization is typically used in service-oriented architecture.

With Data virtualization – data can be accessed in near real-time, and can be masked if necessary.
Data virtualization can be used to integrate different sources of truth into a single view within the enterprise.
Access policies and privacy may be costly to implement in a data virtualization environment.