delta-lake
6 items tagged with "delta-lake"
Articles
Time Travel with Delta Table in PySpark
Delta Lake provides time travel functionalities to retrieve data at certain point of time or at certain version. This can be done easily using the following two options when reading from delta table as DataFrame: versionAsOf - an integer value to specify a version. timestampAsOf - A timestamp or date string. This code snippet shows you how to use them in Spark DataFrameReader APIs. It includes three examples: Query data as version 1 Query data as 30 days ago (using computed value via Spark SQL) Query data as certain timestamp ('2022-09-01 12:00:00.999999UTC 10:00') You may encounter issues if the timestamp is earlier than the earlier commit: pyspark.sql.utils.AnalysisException: The provided timestamp (2022-08-03 00:00:00.0) is before the earliest version available to this table (2022-08-27 10:53:18.213). Please use a timestamp after 2022-08-27 10:53:18. Similarly, if the provided timestamp is later than the last commit, you may encounter another issue like the following: pyspark.sql.utils.AnalysisException: The provided timestamp: 2022-09-07 00:00:00.0 is after the latest commit timestamp of 2022-08-27 11:30:47.185. If you wish to query this version of the table, please either provide the version with "VERSION AS OF 1" or use the exact timestamp of the last commit: "TIMESTAMP AS OF '2022-08-27 11:30:47'". References Delta Lake with PySpark Walkthrough
SCD Type 2 - Implement FULL Merge with Delta Lake Table via PySpark
Streaming from Kafka to Delta Lake Table via PySpark
Delta Lake with PySpark Walkthrough
Diagrams
ACID Support for Data Lake with Delta Lake, Hudi, Iceberg, Hive and Impala
This diagram summarizes the commonly used frameworks to build a data lake that supports ACID (Atomic, Consistency, Isolation, Durability). Apache Hive/Impala with ORC based transactional tables: storage format is ORC.Hive ACID Inserts, Updates and Deletes with ORC. Delta Lake: storage format is parquet with transactional JSON log files. Delta Lake with PySpark Walkthrough. Apache Hudi: storage format is parquet. Apache Iceberg: stored as parquet, ORC or Avro They have different implementation mechanisms but can all support schema evolutions and integrate with Hive meta catalog (metastore) and computing frameworks like Apache Spark, Trino, etc.
Delta Lake Architecture
This diagram shows the architecture of Delta Lake. Delta Lake is an open-source storage framework that can be use to build a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive. It publishes APIs for Scala, Java, Rust, Ruby, and Python.