data-lake
3 items tagged with "data-lake"
Articles
Diagrams
ACID Support for Data Lake with Delta Lake, Hudi, Iceberg, Hive and Impala
This diagram summarizes the commonly used frameworks to build a data lake that supports ACID (Atomic, Consistency, Isolation, Durability). Apache Hive/Impala with ORC based transactional tables: storage format is ORC.Hive ACID Inserts, Updates and Deletes with ORC. Delta Lake: storage format is parquet with transactional JSON log files. Delta Lake with PySpark Walkthrough. Apache Hudi: storage format is parquet. Apache Iceberg: stored as parquet, ORC or Avro They have different implementation mechanisms but can all support schema evolutions and integrate with Hive meta catalog (metastore) and computing frameworks like Apache Spark, Trino, etc.
Delta Lake Architecture
This diagram shows the architecture of Delta Lake. Delta Lake is an open-source storage framework that can be use to build a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive. It publishes APIs for Scala, Java, Rust, Ruby, and Python.