MERGE Statement in BigQuery

Raymond Raymond event 2021-03-13 visibility 11,405 comment 4
more_vert
MERGE Statement in BigQuery

MERGE statement is commonly used in relational databases. It is also supported by BigQuery as one of the DML statements. This statement can be used to perform UPDATE, INSERT, DELETE in one single statement and perform the operations atomically. 

About MERGE statement

For the details about the syntax of MERGE statement, please refer to the official documentation: Data manipulation language syntax  |  BigQuery  |  Google Cloud.

MERGE [INTO] target_name [[AS] alias]
USING source_name
ON merge_condition
{ when_clause }

As other databases, WHEN clause of BigQuery MERGE statement supports the following scenarios:

  • When matched, UPDATE or DELETE clause can be used.
  • When not matched by target, INSERT clause can be used.
  • When not matched by source, UPDATE or DELETE clause can be used.

For each of the three scenarios, multiple WHEN clause can be issued to add different filters. For example, for matched records, only UPDATE or DELETE certain records that match with the filters. 


20210312220558-cloud-bigquery-merge.png

MERGE example

Assuming there are two tables in data set test.

Target table: dim_customer

CustomerIDCustomerNumberNameIsCurrentStartDateEndDate
e42a8b6e-e558-43fe-b5c2-378e681652bf
001Raymond12021-01-019999-12-31
77878d7f-4e4f-4633-b7c9-796229e06cd3
002Jason12021-02-019999-12-31

The schema of this table is defined as below:

[
    {
        "name": "CustomerID",
        "type": "STRING",
        "mode": "REQUIRED"
    },
    {
        "name": "CustomerNumber",
        "type": "STRING",
        "mode": "REQUIRED"
    },
    {
        "name": "Name",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "IsCurrent",
        "type": "BOOLEAN",
        "mode": "REQUIRED"
    },
    {
        "name": "StartDate",
        "type": "DATE",
        "mode": "REQUIRED"
    },
    {
        "name": "EndDate",
        "type": "DATE",
        "mode": "REQUIRED"
    }
]

In target table, IsCurrent column indicates whether the record is the latest record; StartDate and EndDate columns indicates the effective period of each record. CustomerNumber is the business key or natural key of each customer while CustomerID is the surrogate key (UUID data type). 

The values were populated using the following INSERT statements:

insert into `test.dim_customer` values(GENERATE_UUID(),'001','Raymond',true,'2021-01-01','9999-12-31');
insert into `test.dim_customer` values(GENERATE_UUID(),'002','Jason',true,'2021-02-01','9999-12-31');

Source table: staging_customer

CustomerNumberName
001Ray
003

Celia

The schema of the table is defined below:

[
    {
        "name": "CustomerNumber",
        "type": "STRING",
        "mode": "REQUIRED"
    },
    {
        "name": "Name",
        "type": "STRING"
    }
]

The values in this table were populated using the following INSERT statements:

insert into `test.staging_customer` values('001','Ray');
insert into `test.staging_customer` values('003','Celia');

Code snippet

To merge staging_customer table into dim_customer, we can use the MERGE statement. 

infoA FULL merge transformation is performed which assumes the source data is a FULL snapshot instead of DELTA or CDC records. 

Let's first write a query to indicate the actions that need to be performed. Since MATCHED records only support UPDATE or DELETE action, we also need to ensure new records can be inserted with new values. 

SELECT
  COALESCE(staging.CustomerNumber,
    dim.CustomerNumber) AS CustomerNumber,
  COALESCE(dim.CustomerID,
    GENERATE_UUID()) AS CustomerID,
  dim.Name AS Name_Current,
  staging.Name AS Name_New,
  CASE
    WHEN dim.CustomerID IS NULL THEN 'INSERT'
    WHEN staging.CustomerNumber IS NULL THEN 'DELETE'
    WHEN staging.Name <> dim.Name THEN 'UPDATE'
END
  AS MergeAction,
  CASE
    WHEN dim.CustomerID IS NULL THEN CURRENT_DATE()
  ELSE
  dim.StartDate
END
  AS StartDate,
  CASE
    WHEN dim.CustomerID IS NULL THEN date'9999-12-31'
  ELSE
  CURRENT_DATE()-1
END
  AS EndDate
FROM
  `test.staging_customer` staging
FULL JOIN (
  SELECT
    *
  FROM
    `test.dim_customer`
  WHERE
    EndDate=Date'9999-12-31') AS dim
ON
  staging.CustomerNumber = dim.CustomerNumber
UNION ALL
SELECT
  dim.CustomerNumber,
  dim.CustomerID,
  dim.Name AS Name_Current,
  staging.Name AS Name_New,
  'INSERT' AS MergeAction,
  CURRENT_DATE() AS StartDate,
  DATE'9999-12-31' AS EndDate
FROM
  `test.staging_customer` staging
INNER JOIN
  `test.dim_customer` dim
ON
  staging.CustomerNumber = dim.CustomerNumber
  AND dim.EndDate=Date'9999-12-31'
  AND dim.Name <> staging.Name ;

The output looks like the following:

20210312234648-image.png

This query generates all the information we need to merge into the target table.

Now we can use the following MERGE statement to merge the data from staging into datamart. 

MERGE INTO
  `test.dim_customer` AS T
USING
  (
  SELECT
    COALESCE(staging.CustomerNumber,
      dim.CustomerNumber) AS CustomerNumber,
    COALESCE(dim.CustomerID,
      GENERATE_UUID()) AS CustomerID,
    dim.Name AS Name_Current,
    staging.Name AS Name_New,
    CASE
      WHEN dim.CustomerID IS NULL THEN 'INSERT'
      WHEN staging.CustomerNumber IS NULL THEN 'DELETE'
      WHEN staging.Name <> dim.Name THEN 'UPDATE'
  END
    AS MergeAction,
    CASE
      WHEN dim.CustomerID IS NULL THEN CURRENT_DATE()
    ELSE
    dim.StartDate
  END
    AS StartDate,
    CASE
      WHEN dim.CustomerID IS NULL THEN date'9999-12-31'
    ELSE
    CURRENT_DATE()-1
  END
    AS EndDate
  FROM
    `test.staging_customer` staging
  FULL JOIN (
    SELECT
      *
    FROM
      `test.dim_customer`
    WHERE
      EndDate=Date'9999-12-31') AS dim
  ON
    staging.CustomerNumber = dim.CustomerNumber
  UNION ALL
  SELECT
    dim.CustomerNumber,
    dim.CustomerID,
    dim.Name AS Name_Current,
    staging.Name AS Name_New,
    'INSERT' AS MergeAction,
    CURRENT_DATE() AS StartDate,
    DATE'9999-12-31' AS EndDate
  FROM
    `test.staging_customer` staging
  INNER JOIN
    `test.dim_customer` dim
  ON
    staging.CustomerNumber = dim.CustomerNumber
    AND dim.EndDate=Date'9999-12-31'
    AND dim.Name <> staging.Name ) AS S
ON
  T.CustomerID = S.CustomerID
  AND S.MergeAction IN ('UPDATE',
    'DELETE')
  WHEN MATCHED AND S.MergeAction = 'UPDATE' THEN UPDATE SET EndDate = S.EndDate, IsCurrent=FALSE
  WHEN MATCHED
  AND S.MergeAction = 'DELETE' THEN
UPDATE
SET
  EndDate = S.EndDate
  WHEN NOT MATCHED BY TARGET
  AND S.MergeAction='INSERT' THEN
INSERT
  (CustomerID,
    CustomerNumber,
    Name,
    IsCurrent,
    StartDate,
    EndDate)
VALUES
  (S.CustomerID, S.CustomerNumber, S.Name_New, TRUE, S.StartDate, S.EndDate) ;

After the statement is executed, the target table looks like the following screenshot:

2021031300937-image.png

infoNote: in many data warehousing implementations, people also adds a new attribute IsDeleted and insert a new record to indicate the period when the record is deleted from source. For simplicity, I'm not implementing that. 
infoNot matched by source clause is not used in the above example, as the subquery already ensured all records can be joined including deleted one. 

Summary 

This article provides an example about implementation SCD 2 dimensions using MERGE statement in BigQuery. There are several other approaches to implement SCD 2 in BigQuery with varied performance. I will explore them in other articles. 

More from Kontext
comment Comments
N Niko Gallo

Niko access_time 3 years ago link more_vert

hi Raymond,

thank you so much for your article, much appreciated!!!!

I made a change from your initial code: if you run the code in the same day with UPDATE record your code generate a duplicate, so:

  1. I have rename your initial 'DELETE' in 'CLOSE'
  2. i have insert a new transaction-type in the first select, in this case 'DELETE', with this condition: WHEN dim.StartDate= CURRENT_DATE() and staging.Name <> dim.Name THEN 'DELETE' (before the update transaction-type)
  3. and in the merge condition: WHEN MATCHED AND S.MergeAction = 'DELETE' THEN DELETE

otherwise it was perfect :)

ps.: sorry for my english!

Niko



Raymond Raymond

Raymond access_time 3 years ago link more_vert

Hi Niko,

I'm glad you find it useful.

The example I provided is for a daily batch process thus I have not thought about intra-day process. When you mention about running the code on the same day with updated records, do you mean you need to run the merge on the same day with different records in staging table?

N Niko Gallo

Niko access_time 3 years ago link more_vert

the run is daily, but if there is an error-robot and you have to re-run.

generally when I develop a job I try to make sure that it always does the same thing without generating errors

Raymond Raymond

Raymond access_time 3 years ago link more_vert

Thanks for the clarifications. It makes sense.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts