Lakehouse

Yik San Chan, Wed Aug 25 2021 • data notes

This summary is based on the Lakehouse paper by Databricks. Many thanks to Ruben Berenguel's writing that helps me sort it out.

Data warehousing has evolved a lot, as shown in the following figure copied from the paper.

evolution

First Generation: Data Warehouses

Pros:

Management features such as ACID transactions, data versioning, auditing.
Performance features such as indexing, caching and query optimization.

Cons:

Cannot access internal format.
Cannot store unstructured data.
Cannot handle data science or machine learning workloads - data is large compared to BI, need to process using non-SQL code, reading data via JDBC is slow

Second Generation: two-tier lake + warehouse architecture

Data first go to the lake, and then get synced to warehouses.

Pros:

Lake provides low-cost and directly-accessible storage (previously HDFS, now S3) in open format (Parquet).
Support unstructured data.

Cons:

Lake loses management and performance features.
Difficult to keep lake and warehouse data in sync.
Data warehouse can run late in data.
Duplicated storage between lake and warehouse.

Third generation: Lakehouse architecture

Best of both worlds! Here's how:

Add a transactional metadata layer that supports management features.
Implement performance optimizations:
- Indexing: (not discussed yet).
- Caching: use SSDs and RAM as the cache.
- Query optimization: record ordering.

Conclusion

Note that Lakehouse is more of a specification, while Delta Lake is an implementation by Databricks. Go check it out!

Feedback is a gift! Please send your feedback via email or Twitter.

2021 © Yik San Chan. Built with Vercel and Nextra.