arrow_back SCD Type 2 - Implement FULL Merge with Delta Lake Table via PySpark

27 days ago link more_vert
Raymond Raymond
articleArticles 546
imageDiagrams 49
codeCode 2
chat_bubble_outlineThreads 8
commentComments 250
loyaltyKontext Points 6005
account_circleProfile
#1739 Re: SCD Type 2 - Implement FULL Merge with Delta Lake Table via PySpark

It definitely can work with streaming context. In fact, it works better (from the perspective of the volume of data to process). For streaming cases, your CDC data usually includes inserts, updates and deletes. The delta extracts usually looks like this:

Data Engineering - Delta Extract

You will need to make decisions about the following:

  • Is it important to capture every type of changes for each entity during a micro batch?
  • If not, you can then use the last changed record for each type (inserts/updates/deletes).

Then you can use similar merge functions to merge your data. One thing to notice is that for most of CDC systems in source/upstream, an update usually have two records: deletes and inserts. So you will need to design your merge behavior accordingly. 

format_quote

person Gabriel access_time 28 days ago
Re: SCD Type 2 - Implement FULL Merge with Delta Lake Table via PySpark

Does this work in a streaming context? I found that when I apply multiple whenMatchedUpdate then only the first one is picked up and applied.

recommendMore from Kontext