MERGE Statement in BigQuery
MERGE statement is commonly used in relational databases. It is also supported by BigQuery as one of the DML statements. This statement can be used to perform UPDATE, INSERT, DELETE in one single statement and perform the operations atomically.
About MERGE statement
For the details about the syntax of MERGE statement, please refer to the official documentation: Data manipulation language syntax | BigQuery | Google Cloud.
MERGE [INTO] target_name [[AS] alias] USING source_name ON merge_condition { when_clause }
As other databases, WHEN clause of BigQuery MERGE statement supports the following scenarios:
- When matched, UPDATE or DELETE clause can be used.
- When not matched by target, INSERT clause can be used.
- When not matched by source, UPDATE or DELETE clause can be used.
For each of the three scenarios, multiple WHEN clause can be issued to add different filters. For example, for matched records, only UPDATE or DELETE certain records that match with the filters.
MERGE example
Assuming there are two tables in data set test.
Target table: dim_customer
CustomerID | CustomerNumber | Name | IsCurrent | StartDate | EndDate |
e42a8b6e-e558-43fe-b5c2-378e681652bf | 001 | Raymond | 1 | 2021-01-01 | 9999-12-31 |
77878d7f-4e4f-4633-b7c9-796229e06cd3 | 002 | Jason | 1 | 2021-02-01 | 9999-12-31 |
The schema of this table is defined as below:
[ { "name": "CustomerID", "type": "STRING", "mode": "REQUIRED" }, { "name": "CustomerNumber", "type": "STRING", "mode": "REQUIRED" }, { "name": "Name", "type": "STRING", "mode": "NULLABLE" }, { "name": "IsCurrent", "type": "BOOLEAN", "mode": "REQUIRED" }, { "name": "StartDate", "type": "DATE", "mode": "REQUIRED" }, { "name": "EndDate", "type": "DATE", "mode": "REQUIRED" } ]
In target table, IsCurrent column indicates whether the record is the latest record; StartDate and EndDate columns indicates the effective period of each record. CustomerNumber is the business key or natural key of each customer while CustomerID is the surrogate key (UUID data type).
The values were populated using the following INSERT statements:
insert into `test.dim_customer` values(GENERATE_UUID(),'001','Raymond',true,'2021-01-01','9999-12-31'); insert into `test.dim_customer` values(GENERATE_UUID(),'002','Jason',true,'2021-02-01','9999-12-31');
Source table: staging_customer
CustomerNumber | Name |
001 | Ray |
003 | Celia |
The schema of the table is defined below:
[ { "name": "CustomerNumber", "type": "STRING", "mode": "REQUIRED" }, { "name": "Name", "type": "STRING" } ]
The values in this table were populated using the following INSERT statements:
insert into `test.staging_customer` values('001','Ray'); insert into `test.staging_customer` values('003','Celia');
Code snippet
To merge staging_customer table into dim_customer, we can use the MERGE statement.Let's first write a query to indicate the actions that need to be performed. Since MATCHED records only support UPDATE or DELETE action, we also need to ensure new records can be inserted with new values.
SELECT COALESCE(staging.CustomerNumber, dim.CustomerNumber) AS CustomerNumber, COALESCE(dim.CustomerID, GENERATE_UUID()) AS CustomerID, dim.Name AS Name_Current, staging.Name AS Name_New, CASE WHEN dim.CustomerID IS NULL THEN 'INSERT' WHEN staging.CustomerNumber IS NULL THEN 'DELETE' WHEN staging.Name <> dim.Name THEN 'UPDATE' END AS MergeAction, CASE WHEN dim.CustomerID IS NULL THEN CURRENT_DATE() ELSE dim.StartDate END AS StartDate, CASE WHEN dim.CustomerID IS NULL THEN date'9999-12-31' ELSE CURRENT_DATE()-1 END AS EndDate FROM `test.staging_customer` staging FULL JOIN ( SELECT * FROM `test.dim_customer` WHERE EndDate=Date'9999-12-31') AS dim ON staging.CustomerNumber = dim.CustomerNumber UNION ALL SELECT dim.CustomerNumber, dim.CustomerID, dim.Name AS Name_Current, staging.Name AS Name_New, 'INSERT' AS MergeAction, CURRENT_DATE() AS StartDate, DATE'9999-12-31' AS EndDate FROM `test.staging_customer` staging INNER JOIN `test.dim_customer` dim ON staging.CustomerNumber = dim.CustomerNumber AND dim.EndDate=Date'9999-12-31' AND dim.Name <> staging.Name ;
The output looks like the following:
This query generates all the information we need to merge into the target table.
Now we can use the following MERGE statement to merge the data from staging into datamart.
MERGE INTO `test.dim_customer` AS T USING ( SELECT COALESCE(staging.CustomerNumber, dim.CustomerNumber) AS CustomerNumber, COALESCE(dim.CustomerID, GENERATE_UUID()) AS CustomerID, dim.Name AS Name_Current, staging.Name AS Name_New, CASE WHEN dim.CustomerID IS NULL THEN 'INSERT' WHEN staging.CustomerNumber IS NULL THEN 'DELETE' WHEN staging.Name <> dim.Name THEN 'UPDATE' END AS MergeAction, CASE WHEN dim.CustomerID IS NULL THEN CURRENT_DATE() ELSE dim.StartDate END AS StartDate, CASE WHEN dim.CustomerID IS NULL THEN date'9999-12-31' ELSE CURRENT_DATE()-1 END AS EndDate FROM `test.staging_customer` staging FULL JOIN ( SELECT * FROM `test.dim_customer` WHERE EndDate=Date'9999-12-31') AS dim ON staging.CustomerNumber = dim.CustomerNumber UNION ALL SELECT dim.CustomerNumber, dim.CustomerID, dim.Name AS Name_Current, staging.Name AS Name_New, 'INSERT' AS MergeAction, CURRENT_DATE() AS StartDate, DATE'9999-12-31' AS EndDate FROM `test.staging_customer` staging INNER JOIN `test.dim_customer` dim ON staging.CustomerNumber = dim.CustomerNumber AND dim.EndDate=Date'9999-12-31' AND dim.Name <> staging.Name ) AS S ON T.CustomerID = S.CustomerID AND S.MergeAction IN ('UPDATE', 'DELETE') WHEN MATCHED AND S.MergeAction = 'UPDATE' THEN UPDATE SET EndDate = S.EndDate, IsCurrent=FALSE WHEN MATCHED AND S.MergeAction = 'DELETE' THEN UPDATE SET EndDate = S.EndDate WHEN NOT MATCHED BY TARGET AND S.MergeAction='INSERT' THEN INSERT (CustomerID, CustomerNumber, Name, IsCurrent, StartDate, EndDate) VALUES (S.CustomerID, S.CustomerNumber, S.Name_New, TRUE, S.StartDate, S.EndDate) ;
After the statement is executed, the target table looks like the following screenshot:
Summary
This article provides an example about implementation SCD 2 dimensions using MERGE statement in BigQuery. There are several other approaches to implement SCD 2 in BigQuery with varied performance. I will explore them in other articles.
Hi Niko,
I'm glad you find it useful.
The example I provided is for a daily batch process thus I have not thought about intra-day process. When you mention about running the code on the same day with updated records, do you mean you need to run the merge on the same day with different records in staging table?
the run is daily, but if there is an error-robot and you have to re-run.
generally when I develop a job I try to make sure that it always does the same thing without generating errors
Thanks for the clarifications. It makes sense.
hi Raymond,
thank you so much for your article, much appreciated!!!!
I made a change from your initial code: if you run the code in the same day with UPDATE record your code generate a duplicate, so:
otherwise it was perfect :)
ps.: sorry for my english!
Niko