access_time 4 months ago languageEnglish
more_vert

SCD Type 2 - Implement FULL Merge with Delta Lake Table via PySpark

visibility 1,425 comment 2
Slowly Changing Dimension has been commonly used in traditional data warehouse projects. It doesn't only save storage space but also make certain query pathways easier. Thus, in data lake context, it is still relevant. This article shows you how to implement a FULL merge into a delta SCD type 2 ...
info Last modified by Raymond 4 months ago
thumb_up 2

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

comment Comments
27 days ago link more_vert
Raymond Raymond
articleArticles 546
imageDiagrams 49
codeCode 2
chat_bubble_outlineThreads 8
commentComments 250
loyaltyKontext Points 6005
account_circleProfile
#1739 Re: SCD Type 2 - Implement FULL Merge with Delta Lake Table via PySpark

It definitely can work with streaming context. In fact, it works better (from the perspective of the volume of data to process). For streaming cases, your CDC data usually includes inserts, updates and deletes. The delta extracts usually looks like this:

Data Engineering - Delta Extract

You will need to make decisions about the following:

  • Is it important to capture every type of changes for each entity during a micro batch?
  • If not, you can then use the last changed record for each type (inserts/updates/deletes).

Then you can use similar merge functions to merge your data. One thing to notice is that for most of CDC systems in source/upstream, an update usually have two records: deletes and inserts. So you will need to design your merge behavior accordingly. 

format_quote

person Gabriel access_time 28 days ago
Re: SCD Type 2 - Implement FULL Merge with Delta Lake Table via PySpark

Does this work in a streaming context? I found that when I apply multiple whenMatchedUpdate then only the first one is picked up and applied.

recommendMore from Kontext