data-engineering
18 items tagged with "data-engineering"
Articles
Diagrams
Apache Doris Realtime MPP Data Analytics Use Scenarios
This diagram shows the common usage scenarios of Apache Doris - the open source real-time data warehouse.
ACID Support for Data Lake with Delta Lake, Hudi, Iceberg, Hive and Impala
This diagram summarizes the commonly used frameworks to build a data lake that supports ACID (Atomic, Consistency, Isolation, Durability). Apache Hive/Impala with ORC based transactional tables: storage format is ORC.Hive ACID Inserts, Updates and Deletes with ORC. Delta Lake: storage format is parquet with transactional JSON log files. Delta Lake with PySpark Walkthrough. Apache Hudi: storage format is parquet. Apache Iceberg: stored as parquet, ORC or Avro They have different implementation mechanisms but can all support schema evolutions and integrate with Hive meta catalog (metastore) and computing frameworks like Apache Spark, Trino, etc.
Slowly Changing Dimension (SCD) Type 4
This diagram shows how a slowly changing dimension type 4 table is implemented. customernumber is the business key of the customer table while customerid is a surrogate key. Customer 10001 is changing first_name from Kontext to Context. SCD Type 4 uses a history table to track the historical changes. This method is similar as database change capture or auditing table implementations.
Slowly Changing Dimension (SCD) Type 3
This diagram shows how a slowly changing dimension type 3 table is implemented. customernumber is the business key of the customer table while customerid is a surrogate key. Customer 10001 is changing first_name from Kontext to Context. SCD Type 3 will add a new attribute to keep the current value. The drawback is that it can only keep previous and current values only. For SCD Type 3, surrogate ID is not necessary.
Slowly Changing Dimension (SCD) Type 2
This diagram shows how a slowly changing dimension type 2 table is implemented. customernumber is the business key of the customer table while customerid is a surrogate key. Customer 10001 is changing first_name from Kontext to Context. SCD Type 2 will track history of changes. It is usually implemented to use effectivedate to indicate the start date when the change becomes effective and iscurrent flag to indicate the current active records. It can also be implemented using effectivefromdate and effectivetodate. When effective\to\date is NULL or equals to a high date like (9999-12-31), the record is the latest version. For performance and other considerations, it can also be implemented with a combination of effectivefromdate, effectivetodate and is_current.
Slowly Changing Dimension (SCD) Type 1
This diagram shows how a slowly changing dimension type 1 table is implemented. SCD Type 1 will simply overwrite old data with new data without keeping a history. customernumber is the business key of the customer table while customerid is a surrogate key. Customer 10001 is changing first_name from Kontext to Context. For SCD Type 1, surrogate ID is not necessary.
AWS ETL Solution with Glue Diagram
This diagram shows one example of using AWS Glue to crawl, catalog and perform data stored in S3. Data landed in raw bucket is scanned by Glue Crawler and the metadata is stored in Glue Catalog. Glue ETL job loads the raw data and does transformations and eventually store the processed data in curated bucket. The processed files are scanned by Glue Crawler. Processed data is then queried by Amazon Athena. The data can be further utilized in reporting and dashboard.
AWS Big Data Lambda Architecture for Streaming Analytics
This diagram shows a typical lambda streaming processing solution on AWS with Amazon Kinesis, Amazon Glue, Amazon S3, Amazon Athena and Amazon Quicksight: Amazon Kinesis - capture streaming data via Data Firehose and then transform and analyze streaming data using Data Analytics; the result of analytics is stored into another Data Firehose process; for batch processing, the captured streaming data can be directly loaded into S3 bucket too. Amazon S3 - store streaming raw data and batch processed data. Amazon Glue - transform batch data in S3 and store the processed data into another bucket for consumption. Amazon Athena - used to read data in S3 via SQL. Amazon Quicksight - data visualization tool. References AWS IoT Streaming Processing Solution Diagram AWS IoT Streaming Processing Solution Diagram w Glue
AWS IoT Streaming Processing Solution Diagram w Glue
This diagram shows a typical streaming processing solution on AWS with Amazon Kinesis, Amazon Glue, Amazon S3, Amazon Athena and Amazon Quicksight: Amazon Kinesis - capture streaming data via Data Firehose and then load the data to S3. Amazon S3 - store streaming raw data and batch processed data. Amazon Glue - transform batch data in S3 and store the processed data into another bucket for consumption. Amazon Athena - used to read data in S3 via SQL. Amazon Quicksight - data visualization tool. Similar solution diagram using streaming transformation: AWS IoT Streaming Processing Solution Diagram.
AWS IoT Streaming Processing Solution Diagram
This diagram shows a typical streaming processing solution on AWS with Amazon Kinesis, Amazon S3, Amazon Athena and Amazon Quicksight: Amazon Kinesis - capture streaming data via Data Firehose and then transform and analyze streaming data using Data Analytics; the result of analytics is stored into another Data Firehose process. Amazon S3 - streaming processed data is stored in Amazon S3. Amazon Athena - used to read data in S3 via SQL. Amazon Quicksight - data visualization tool.
AWS Batch Processing Solution Diagram (using AWS Glue)
This diagram shows a typical batch processing solution on AWS with Amazon S3, AWS Lambda, Amazon Glue and Amazon Redshift: Amazon S3 is used to store staging data extracted from source systems on-premises or on-cloud. AWS Lambda is used to register data arrival in S3 buckets into ETL frameworks and trigger batch process process. Amazon Glueis then used to integrate data like merging, sorting, filtering, aggregations, transformations and load the data. Amazon Redshift is then used to store the transformed data. This diagram is forked from AWS Batch Processing Solution Diagram
AWS Batch Processing Solution Diagram
This diagram shows a typical batch processing solution on AWS with Amazon S3, AWS Lambda, Amazon EMR and Amazon Redshift: Amazon S3 is used to store staging data extracted from source systems on-premises or on-cloud. AWS Lambda is used to register data arrival in S3 buckets into ETL frameworks and trigger batch process process. Amazon EMR is then used to transform data like aggregations and load the data. Amazon Redshift is then used to store the transformed data. This pattern follow the traditional ETL pattern and you can change it to ELT pattern too to do transformations in Redshift directly. Amazon EMR can be replaced with many other products.
Data Distribution Approaches in Parallel Computing System
This diagram shows the typical algorithms to distribute data into a cluster for processing or computing. They are commonly used in systems like Teradata, SQL Server PWD, Azure Synapse, Spark, etc. Replicated- table are replicated to each node. This is useful to distribute small tables like reference tables to join with big tables. Round-robin distributed - data is randomly distributed to the nodes in the cluster using round-robin algorithm. This is useful for big tables without obvious candidate join keys. This will ensure data is evenly distributed across the cluster. Hash distributed - data is distributed using deterministic hashing algorithm on the key values. Same value will guarantee to be distributed to the same node. This is the most commonly used distribution approach for big tables. What are the other distribution algorithms you have used?
Data Engineering - Transactional Extract
When extracting data from source databases, there are usually three approaches: Full extract - extracts all the records from the source table. This usually applies to small to medium size tables, for example, product tables. Transactional extract - extracts all the new records from the last extract. This is good for transactional sources like banking transactions, logs, etc. Delta extract - extracts all the changed records including updated, new and deleted ones. This can be used to reduce the amount of data that needs to be transferred from source system to target system. This usually involves CDC (change data capture). This diagram shows the high-level flow of transactional extract. Usually you will use your ETL (Extract, Transform, Load) framework to record last extract point, for example, date, timestamp or auto-increment column value.
Data Engineering - Full Extract
When extracting data from source databases, there are usually three approaches: Full extract - extracts all the records from the source table. This usually applies to small to medium size tables, for example, product tables. Transactional extract - extracts all the new records from the last extract. This is good for transactional sources like banking transactions, logs, etc. Delta extract - extracts all the changed records including updated, new and deleted ones. This can be used to reduce the amount of data that needs to be transferred from source system to target system. This usually involves CDC (change data capture). This diagram shows the high-level flow of full extract.
Data Engineering - Delta Extract
When extracting data from source databases, there are usually three approaches: Full extract - extracts all the records from the source table. This usually applies to small to medium size tables, for example, product tables. Transactional extract - extracts all the new records from the last extract. This is good for transactional sources like banking transactions, logs, etc. Delta extract - extracts all the changed records including updated, new and deleted ones. This can be used to reduce the amount of data that needs to be transferred from source system to target system. This usually involves CDC (change data capture). This diagram shows the high-level flow of delta extract. It is very important to include the file for deleted records unless your data source never deletes records. If your source team cannot provide the deleted records, you will need to request a full extract so that you can work out the changes by comparing with your target tables.