MERGE statement is commonly used in relational databases. It is also supported by BigQuery as one of the DML statements. This statement can be used to perform UPDATE, INSERT, DELETE in one single statement and perform the operations atomically.
About MERGE statement
For the details about the syntax of MERGE statement, please refer to the official documentation: Data manipulation language syntax | BigQuery | Google Cloud.
MERGE [INTO] target_name [[AS] alias]
USING source_name
ON merge_condition
{ when_clause }
As other databases, WHEN clause of BigQuery MERGE statement supports the following scenarios:
- When matched, UPDATE or DELETE clause can be used.
- When not matched by target, INSERT clause can be used.
- When not matched by source, UPDATE or DELETE clause can be used.
For each of the three scenarios, multiple WHEN clause can be issued to add different filters. For example, for matched records, only UPDATE or DELETE certain records that match with the filters.
MERGE example
Assuming there are two tables in data set test.
Target table: **dim\_customer**
CustomerID | CustomerNumber | Name | IsCurrent | StartDate | EndDate |
---|---|---|---|---|---|
e42a8b6e-e558-43fe-b5c2-378e681652bf | 001 | Raymond | 1 | 2021-01-01 | 9999-12-31 |
77878d7f-4e4f-4633-b7c9-796229e06cd3 | 002 | Jason | 1 | 2021-02-01 | 9999-12-31 |
The schema of this table is defined as below:
[
{
"name": "CustomerID",
"type": "STRING",
"mode": "REQUIRED"
},
{
"name": "CustomerNumber",
"type": "STRING",
"mode": "REQUIRED"
},
{
"name": "Name",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "IsCurrent",
"type": "BOOLEAN",
"mode": "REQUIRED"
},
{
"name": "StartDate",
"type": "DATE",
"mode": "REQUIRED"
},
{
"name": "EndDate",
"type": "DATE",
"mode": "REQUIRED"
}
]
In target table, IsCurrentcolumn indicates whether the record is the latest record; StartDateand EndDate columns indicates the effective period of each record. CustomerNumberis the business key or natural key of each customer while CustomerID is the surrogate key (UUID data type).
The values were populated using the following INSERT statements:
insert into `test.dim_customer` values(GENERATE_UUID(),'001','Raymond',true,'2021-01-01','9999-12-31');
insert into `test.dim_customer` values(GENERATE_UUID(),'002','Jason',true,'2021-02-01','9999-12-31');
Source table: **staging\_customer**
CustomerNumber | Name |
---|---|
001 | Ray |
003 | Celia |
The schema of the table is defined below:
[
{
"name": "CustomerNumber",
"type": "STRING",
"mode": "REQUIRED"
},
{
"name": "Name",
"type": "STRING"
}
]
The values in this table were populated using the following INSERT statements:
insert into `test.staging_customer` values('001','Ray');
insert into `test.staging_customer` values('003','Celia');
Code snippet
To merge staging_customertable into dim_customer, we can use the MERGE statement.
infoA FULL merge transformation is performed which assumes the source data is a FULL snapshot instead of DELTA or CDC records.
Let's first write a query to indicate the actions that need to be performed. Since MATCHED records only support UPDATE or DELETE action, we also need to ensure new records can be inserted with new values.
SELECT
COALESCE(staging.CustomerNumber,
dim.CustomerNumber) AS CustomerNumber,
COALESCE(dim.CustomerID,
GENERATE_UUID()) AS CustomerID,
dim.Name AS Name_Current,
staging.Name AS Name_New,
CASE
WHEN dim.CustomerID IS NULL THEN 'INSERT'
WHEN staging.CustomerNumber IS NULL THEN 'DELETE'
WHEN staging.Name <> dim.Name THEN 'UPDATE'
END
AS MergeAction,
CASE
WHEN dim.CustomerID IS NULL THEN CURRENT_DATE()
ELSE
dim.StartDate
END
AS StartDate,
CASE
WHEN dim.CustomerID IS NULL THEN date'9999-12-31'
ELSE
CURRENT_DATE()-1
END
AS EndDate
FROM
`test.staging_customer` staging
FULL JOIN (
SELECT
*
FROM
`test.dim_customer`
WHERE
EndDate=Date'9999-12-31') AS dim
ON
staging.CustomerNumber = dim.CustomerNumber
UNION ALL
SELECT
dim.CustomerNumber,
dim.CustomerID,
dim.Name AS Name_Current,
staging.Name AS Name_New,
'INSERT' AS MergeAction,
CURRENT_DATE() AS StartDate,
DATE'9999-12-31' AS EndDate
FROM
`test.staging_customer` staging
INNER JOIN
`test.dim_customer` dim
ON
staging.CustomerNumber = dim.CustomerNumber
AND dim.EndDate=Date'9999-12-31'
AND dim.Name <> staging.Name ;
The output looks like the following:
This query generates all the information we need to merge into the target table.
Now we can use the following MERGE statement to merge the data from staging into datamart.
MERGE INTO
`test.dim_customer` AS T
USING
(
SELECT
COALESCE(staging.CustomerNumber,
dim.CustomerNumber) AS CustomerNumber,
COALESCE(dim.CustomerID,
GENERATE_UUID()) AS CustomerID,
dim.Name AS Name_Current,
staging.Name AS Name_New,
CASE
WHEN dim.CustomerID IS NULL THEN 'INSERT'
WHEN staging.CustomerNumber IS NULL THEN 'DELETE'
WHEN staging.Name <> dim.Name THEN 'UPDATE'
END
AS MergeAction,
CASE
WHEN dim.CustomerID IS NULL THEN CURRENT_DATE()
ELSE
dim.StartDate
END
AS StartDate,
CASE
WHEN dim.CustomerID IS NULL THEN date'9999-12-31'
ELSE
CURRENT_DATE()-1
END
AS EndDate
FROM
`test.staging_customer` staging
FULL JOIN (
SELECT
*
FROM
`test.dim_customer`
WHERE
EndDate=Date'9999-12-31') AS dim
ON
staging.CustomerNumber = dim.CustomerNumber
UNION ALL
SELECT
dim.CustomerNumber,
dim.CustomerID,
dim.Name AS Name_Current,
staging.Name AS Name_New,
'INSERT' AS MergeAction,
CURRENT_DATE() AS StartDate,
DATE'9999-12-31' AS EndDate
FROM
`test.staging_customer` staging
INNER JOIN
`test.dim_customer` dim
ON
staging.CustomerNumber = dim.CustomerNumber
AND dim.EndDate=Date'9999-12-31'
AND dim.Name <> staging.Name ) AS S
ON
T.CustomerID = S.CustomerID
AND S.MergeAction IN ('UPDATE',
'DELETE')
WHEN MATCHED AND S.MergeAction = 'UPDATE' THEN UPDATE SET EndDate = S.EndDate, IsCurrent=FALSE
WHEN MATCHED
AND S.MergeAction = 'DELETE' THEN
UPDATE
SET
EndDate = S.EndDate
WHEN NOT MATCHED BY TARGET
AND S.MergeAction='INSERT' THEN
INSERT
(CustomerID,
CustomerNumber,
Name,
IsCurrent,
StartDate,
EndDate)
VALUES
(S.CustomerID, S.CustomerNumber, S.Name_New, TRUE, S.StartDate, S.EndDate) ;
After the statement is executed, the target table looks like the following screenshot:
infoNote: in many data warehousing implementations, people also adds a new attribute IsDeleted and insert a new record to indicate the period when the record is deleted from source. For simplicity, I'm not implementing that.
infoNot matched by source clause is not used in the above example, as the subquery already ensured all records can be joined including deleted one.
Summary
This article provides an example about implementation SCD 2 dimensions using MERGE statement in BigQuery. There are several other approaches to implement SCD 2 in BigQuery with varied performance. I will explore them in other articles.