MERGE Statement in BigQuery

MERGE Statement in BigQuery

Raymond Tang Raymond Tang 0 11555 7.39 index 3/13/2021

MERGE statement is commonly used in relational databases. It is also supported by BigQuery as one of the DML statements. This statement can be used to perform UPDATE, INSERT, DELETE in one single statement and perform the operations atomically.

About MERGE statement

For the details about the syntax of MERGE statement, please refer to the official documentation: Data manipulation language syntax  |  BigQuery  |  Google Cloud.

MERGE [INTO] target_name [[AS] alias]
USING source_name
ON merge_condition
{ when_clause }

As other databases, WHEN clause of BigQuery MERGE statement supports the following scenarios:

  • When matched, UPDATE or DELETE clause can be used.
  • When not matched by target, INSERT clause can be used.
  • When not matched by source, UPDATE or DELETE clause can be used.

For each of the three scenarios, multiple WHEN clause can be issued to add different filters. For example, for matched records, only UPDATE or DELETE certain records that match with the filters.

20210312220558-cloud-bigquery-merge.png

MERGE example

Assuming there are two tables in data set test.

Target table: **dim\_customer**

CustomerID CustomerNumber Name IsCurrent StartDate EndDate
e42a8b6e-e558-43fe-b5c2-378e681652bf 001 Raymond 1 2021-01-01 9999-12-31
77878d7f-4e4f-4633-b7c9-796229e06cd3 002 Jason 1 2021-02-01 9999-12-31

The schema of this table is defined as below:

[
    {
        "name": "CustomerID",
        "type": "STRING",
        "mode": "REQUIRED"
    },
    {
        "name": "CustomerNumber",
        "type": "STRING",
        "mode": "REQUIRED"
    },
    {
        "name": "Name",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "IsCurrent",
        "type": "BOOLEAN",
        "mode": "REQUIRED"
    },
    {
        "name": "StartDate",
        "type": "DATE",
        "mode": "REQUIRED"
    },
    {
        "name": "EndDate",
        "type": "DATE",
        "mode": "REQUIRED"
    }
]

In target table, IsCurrentcolumn indicates whether the record is the latest record; StartDateand EndDate columns indicates the effective period of each record. CustomerNumberis the business key or natural key of each customer while CustomerID is the surrogate key (UUID data type).

The values were populated using the following INSERT statements:

insert into `test.dim_customer` values(GENERATE_UUID(),'001','Raymond',true,'2021-01-01','9999-12-31');
insert into `test.dim_customer` values(GENERATE_UUID(),'002','Jason',true,'2021-02-01','9999-12-31');

Source table: **staging\_customer**

CustomerNumber Name
001 Ray
003 Celia

The schema of the table is defined below:

[
    {
        "name": "CustomerNumber",
        "type": "STRING",
        "mode": "REQUIRED"
    },
    {
        "name": "Name",
        "type": "STRING"
    }
]

The values in this table were populated using the following INSERT statements:

insert into `test.staging_customer` values('001','Ray');
insert into `test.staging_customer` values('003','Celia');

Code snippet

To merge staging_customertable into dim_customer, we can use the MERGE statement. 

infoA FULL merge transformation is performed which assumes the source data is a FULL snapshot instead of DELTA or CDC records. 

Let's first write a query to indicate the actions that need to be performed. Since MATCHED records only support UPDATE or DELETE action, we also need to ensure new records can be inserted with new values.

SELECT
  COALESCE(staging.CustomerNumber,
    dim.CustomerNumber) AS CustomerNumber,
  COALESCE(dim.CustomerID,
    GENERATE_UUID()) AS CustomerID,
  dim.Name AS Name_Current,
  staging.Name AS Name_New,
  CASE
    WHEN dim.CustomerID IS NULL THEN 'INSERT'
    WHEN staging.CustomerNumber IS NULL THEN 'DELETE'
    WHEN staging.Name <> dim.Name THEN 'UPDATE'
END
  AS MergeAction,
  CASE
    WHEN dim.CustomerID IS NULL THEN CURRENT_DATE()
  ELSE
  dim.StartDate
END
  AS StartDate,
  CASE
    WHEN dim.CustomerID IS NULL THEN date'9999-12-31'
  ELSE
  CURRENT_DATE()-1
END
  AS EndDate
FROM
  `test.staging_customer` staging
FULL JOIN (
  SELECT
    *
  FROM
    `test.dim_customer`
  WHERE
    EndDate=Date'9999-12-31') AS dim
ON
  staging.CustomerNumber = dim.CustomerNumber
UNION ALL
SELECT
  dim.CustomerNumber,
  dim.CustomerID,
  dim.Name AS Name_Current,
  staging.Name AS Name_New,
  'INSERT' AS MergeAction,
  CURRENT_DATE() AS StartDate,
  DATE'9999-12-31' AS EndDate
FROM
  `test.staging_customer` staging
INNER JOIN
  `test.dim_customer` dim
ON
  staging.CustomerNumber = dim.CustomerNumber
  AND dim.EndDate=Date'9999-12-31'
  AND dim.Name <> staging.Name ;

The output looks like the following:

20210312234648-image.png

This query generates all the information we need to merge into the target table.

Now we can use the following MERGE statement to merge the data from staging into datamart.

MERGE INTO
  `test.dim_customer` AS T
USING
  (
  SELECT
    COALESCE(staging.CustomerNumber,
      dim.CustomerNumber) AS CustomerNumber,
    COALESCE(dim.CustomerID,
      GENERATE_UUID()) AS CustomerID,
    dim.Name AS Name_Current,
    staging.Name AS Name_New,
    CASE
      WHEN dim.CustomerID IS NULL THEN 'INSERT'
      WHEN staging.CustomerNumber IS NULL THEN 'DELETE'
      WHEN staging.Name <> dim.Name THEN 'UPDATE'
  END
    AS MergeAction,
    CASE
      WHEN dim.CustomerID IS NULL THEN CURRENT_DATE()
    ELSE
    dim.StartDate
  END
    AS StartDate,
    CASE
      WHEN dim.CustomerID IS NULL THEN date'9999-12-31'
    ELSE
    CURRENT_DATE()-1
  END
    AS EndDate
  FROM
    `test.staging_customer` staging
  FULL JOIN (
    SELECT
      *
    FROM
      `test.dim_customer`
    WHERE
      EndDate=Date'9999-12-31') AS dim
  ON
    staging.CustomerNumber = dim.CustomerNumber
  UNION ALL
  SELECT
    dim.CustomerNumber,
    dim.CustomerID,
    dim.Name AS Name_Current,
    staging.Name AS Name_New,
    'INSERT' AS MergeAction,
    CURRENT_DATE() AS StartDate,
    DATE'9999-12-31' AS EndDate
  FROM
    `test.staging_customer` staging
  INNER JOIN
    `test.dim_customer` dim
  ON
    staging.CustomerNumber = dim.CustomerNumber
    AND dim.EndDate=Date'9999-12-31'
    AND dim.Name <> staging.Name ) AS S
ON
  T.CustomerID = S.CustomerID
  AND S.MergeAction IN ('UPDATE',
    'DELETE')
  WHEN MATCHED AND S.MergeAction = 'UPDATE' THEN UPDATE SET EndDate = S.EndDate, IsCurrent=FALSE
  WHEN MATCHED
  AND S.MergeAction = 'DELETE' THEN
UPDATE
SET
  EndDate = S.EndDate
  WHEN NOT MATCHED BY TARGET
  AND S.MergeAction='INSERT' THEN
INSERT
  (CustomerID,
    CustomerNumber,
    Name,
    IsCurrent,
    StartDate,
    EndDate)
VALUES
  (S.CustomerID, S.CustomerNumber, S.Name_New, TRUE, S.StartDate, S.EndDate) ;

After the statement is executed, the target table looks like the following screenshot:

2021031300937-image.png

infoNote: in many data warehousing implementations, people also adds a new attribute IsDeleted and insert a new record to indicate the period when the record is deleted from source. For simplicity, I'm not implementing that. 

infoNot matched by source clause is not used in the above example, as the subquery already ensured all records can be joined including deleted one. 

Summary

This article provides an example about implementation SCD 2 dimensions using MERGE statement in BigQuery. There are several other approaches to implement SCD 2 in BigQuery with varied performance. I will explore them in other articles.

bigquery gcp

Join the Discussion

View or add your thoughts below

Comments