access_time 3y languageEnglish
more_vert

Implement SCD Type 2 Full Merge via Spark Data Frames

visibility 19,085 comment 3
For SQL developers that are familiar with SCD and merge statements, you may wonder how to implement the same in big data platforms, considering database or storages in Hadoop are not designed/optimised for record level updates and inserts. In this post, I’m going to demonstrate how to implement ...
info Last modified by Administrator 2y
thumb_up 5
comment Comments
6m link more_vert
S
Subramanyam
web_assetArticles 0
forumThreads 0
commentComments 2
account_circleProfile
#1477 Re: Implement SCD Type 2 Full Merge via Spark Data Frames

Makes sense. Thanks a lot for the reply :-)

format_quote

person Raymond access_time 6m
Re: Implement SCD Type 2 Full Merge via Spark Data Frames

Thanks for reading through, bro. 

My comment in last graph is based on the assumption that the last partition includes all the latest records. Thus once SCD 2 merge is done, there will be two new partitions in your spark data frame:

  1. Expired records with current business date
  2. the latest records with high date (expiry date = 9999-12-31)

Assuming data is partitioned by end date, to write the results into HDFS, you can specify the partition keys for each of them instead of overwriting the whole data set.

*Note this solution may lead to high-skewness unless there are same amount of records need to be expired or inserted everyday. 

Hope this makes sense.  

6m link more_vert
Raymond Raymond
web_assetArticles 549
forumThreads 7
commentComments 168
account_circleProfile
#1476 Re: Implement SCD Type 2 Full Merge via Spark Data Frames

Thanks for reading through, bro. 

My comment in last graph is based on the assumption that the last partition includes all the latest records. Thus once SCD 2 merge is done, there will be two new partitions in your spark data frame:

  1. Expired records with current business date
  2. the latest records with high date (expiry date = 9999-12-31)

Assuming data is partitioned by end date, to write the results into HDFS, you can specify the partition keys for each of them instead of overwriting the whole data set.

*Note this solution may lead to high-skewness unless there are same amount of records need to be expired or inserted everyday. 

Hope this makes sense.  

format_quote

person Subramanyam access_time 6m
Re: Implement SCD Type 2 Full Merge via Spark Data Frames

Nice article bro. Just a doubt, To optimize the full outer join, if we take only the active dataset, it might create a problem at the end, when we are replacing the final dataframe into original target. please clarify. Thanks.

6m link more_vert
S
Subramanyam
web_assetArticles 0
forumThreads 0
commentComments 2
account_circleProfile
#1475 Re: Implement SCD Type 2 Full Merge via Spark Data Frames

Nice article bro. Just a doubt, To optimize the full outer join, if we take only the active dataset, it might create a problem at the end, when we are replacing the final dataframe into original target. please clarify. Thanks.

Forum discussions for column Spark.