Raymond Raymond

Load XML File into BigQuery

event 2021-05-16 visibility 6,297 comment 0 insights toc
more_vert
insights Stats
Load XML File into BigQuery

Google Cloud BigQuery provides APIs that can be accessed by all the mainstream programming languages. It also provides SDKs/packages that can be directly accessed in your applications to load XML file into BigQuery, regardless of whether the file is stored on Google Cloud Storage or in a temporary location that your program has access to. 

Overview

This tutorial provides steps to load data from a local XML file. You can use GCS (Google Cloud Storage) APIs to read XML file stored in GCS bucket. This tutorial uses Pandas to create data frame from XML file and then use Pandas API to save the result to BigQuery.

infoFor simplicity, the Python script used in this article is run in Cloud Shell (Google Cloud Terminal) environment.

Initialize client credential

Follow these steps to setup client credential. 

  1. Follow article Quickstart: Using client libraries  |  BigQuery  |  Google Cloud to configure a service account and download client JSON credential file. You don't necessarily assign project owner to the service account. For this tutorial, you only need to assign read and write access to BigQuery (bigquery.tables.create, bigquery.tables.updateData, bigquery.jobs.create). For simplicity (not best practice), I am adding BigQuery Admin to my service account.
  2. Upload the client credential file to Cloud Shell.
  3. Run the following command to edit bash profile:
    vi ~/.bashrc
  4. Add the following line to it:
    export GOOGLE_APPLICATION_CREDENTIALS="PATH_TO_THE_UPLOADED_CREDENTIAL_FILE.json"

    Remember to update the credential file path to your own. This is important as the credential file will be used directly by the Python client functions though we can also explicitly specify the credential file path.

  5. Activate the service account in Cloud Shell so that we can verify the access:
    gcloud auth activate-service-account service_account_name@project_name.iam.gserviceaccount.com --key-file PATH_TO_THE_UPLOADED_CREDENTIAL_FILE.json

    Remember to update the service account and key file to your own ones. 

  6. Run the following bq command to test access to BigQuery:
    bq ls 

Add an XML file in cloud shell

In Cloud Shell, create an XML file named test.xml with the following content:

<?xml version="1.0" encoding="utf-8"?>
<records>
	<rec id="1" value="A"/>
	<rec id="2" value="B"/>
	<rec id="3" value="C"/>
	<rec id="4" value="D"/>
</records>

20210516124347-image.png

Install required packages

Run the following package to install required packages. Refer to Pandas - Save DataFrame to BigQuery - Kontext to learn more about to_gbq API.

pip install pandas
# Or
pip3 install pandas
pip install pandas-gbq
# Or
pip3 install pandas-gbq

Create a Python script to load the file

Create a Python script file (xml-to-bq.py) with the following content in Cloud Shell:

import pandas as pd
from datetime import datetime
from google.oauth2.service_account import Credentials
from xml.dom.minidom import parse

# Load XML file.
data_records = []
with parse('test.xml') as xml_doc:
    root = xml_doc.documentElement
    records = root.getElementsByTagName('rec')

    for r in records:
        data = {}
        data['id'] = r.getAttributeNode('id').value
        data['value'] = r.getAttributeNode('value').value
        data_records.append(data)
print(data_records)
df = pd.DataFrame(data_records)

# Define target table in BQ
target_table = "YOUR_DATA_SET.xml_test"
project_id = "YOUR_PROJECT_ID"
credential_file = "PATH_TO_YOUR_SERVICE_ACCOUNT_CREDENTIAL_FILE.json"
credential = Credentials.from_service_account_file(credential_file)
# Location for BQ job, it needs to match with destination table location
job_location = "australia-southeast1"

# Save Pandas dataframe to BQ
df.to_gbq(target_table, project_id=project_id, if_exists='replace',
          location=job_location, progress_bar=True, credentials=credential)

Please remember to update the following parts accordingly based on your environment setup:

  • Replace BigQuery destination table name.
  • Replace project ID, credential file path and job location
  • if_exists parameter value. The above code snippet will overwrite existing table. 

Run the script

You can then run the script in Cloud Shell using the following command:

python xml-to-bq.py

Verify the result in BigQuery

Go to your BigQuery project and you will find a new table named xml_test is created successfully. 

20210516124141-image.png

References

Pandas - Save DataFrame to BigQuery - Kontext

Read and Write XML Files with Python - Kontext

More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts