This diagram summarizes the commonly used frameworks to build a data lake that supports ACID (Atomic, Consistency, Isolation, Durability).
- Apache Hive/Impala with ORC based transactional tables: storage format is ORC.Hive ACID Inserts, Updates and Deletes with ORC.
- Delta Lake: storage format is parquet with transactional JSON log files. Delta Lake with PySpark Walkthrough.
- Apache Hudi: storage format is parquet.
- Apache Iceberg: stored as parquet, ORC or Avro
They have different implementation mechanisms but can all support schema evolutions and integrate with Hive meta catalog (metastore) and computing frameworks like Apache Spark, Trino, etc.