Load XML File into BigQuery
Google Cloud BigQuery provides APIs that can be accessed by all the mainstream programming languages. It also provides SDKs/packages that can be directly accessed in your applications to load XML file into BigQuery, regardless of whether the file is stored on Google Cloud Storage or in a temporary location that your program has access to.
Overview
This tutorial provides steps to load data from a local XML file. You can use GCS (Google Cloud Storage) APIs to read XML file stored in GCS bucket. This tutorial uses Pandas to create data frame from XML file and then use Pandas API to save the result to BigQuery.
Initialize client credential
Follow these steps to setup client credential.
- Follow article Quickstart: Using client libraries | BigQuery | Google Cloud to configure a service account and download client JSON credential file. You don't necessarily assign project owner to the service account. For this tutorial, you only need to assign read and write access to BigQuery (bigquery.tables.create, bigquery.tables.updateData, bigquery.jobs.create). For simplicity (not best practice), I am adding BigQuery Admin to my service account.
- Upload the client credential file to Cloud Shell.
- Run the following command to edit bash profile:
vi ~/.bashrc
- Add the following line to it:
export GOOGLE_APPLICATION_CREDENTIALS="PATH_TO_THE_UPLOADED_CREDENTIAL_FILE.json"
Remember to update the credential file path to your own. This is important as the credential file will be used directly by the Python client functions though we can also explicitly specify the credential file path.
- Activate the service account in Cloud Shell so that we can verify the access:
gcloud auth activate-service-account service_account_name@project_name.iam.gserviceaccount.com --key-file PATH_TO_THE_UPLOADED_CREDENTIAL_FILE.json
Remember to update the service account and key file to your own ones.
- Run the following bq command to test access to BigQuery:
bq ls
Add an XML file in cloud shell
In Cloud Shell, create an XML file named test.xml with the following content:
<?xml version="1.0" encoding="utf-8"?> <records> <rec id="1" value="A"/> <rec id="2" value="B"/> <rec id="3" value="C"/> <rec id="4" value="D"/> </records>
Install required packages
Run the following package to install required packages. Refer to Pandas - Save DataFrame to BigQuery - Kontext to learn more about to_gbq API.
pip install pandas # Or pip3 install pandas
pip install pandas-gbq # Or pip3 install pandas-gbq
Create a Python script to load the file
Create a Python script file (xml-to-bq.py) with the following content in Cloud Shell:
import pandas as pd from datetime import datetime from google.oauth2.service_account import Credentials from xml.dom.minidom import parse # Load XML file. data_records = [] with parse('test.xml') as xml_doc: root = xml_doc.documentElement records = root.getElementsByTagName('rec') for r in records: data = {} data['id'] = r.getAttributeNode('id').value data['value'] = r.getAttributeNode('value').value data_records.append(data) print(data_records) df = pd.DataFrame(data_records) # Define target table in BQ target_table = "YOUR_DATA_SET.xml_test" project_id = "YOUR_PROJECT_ID" credential_file = "PATH_TO_YOUR_SERVICE_ACCOUNT_CREDENTIAL_FILE.json" credential = Credentials.from_service_account_file(credential_file) # Location for BQ job, it needs to match with destination table location job_location = "australia-southeast1" # Save Pandas dataframe to BQ df.to_gbq(target_table, project_id=project_id, if_exists='replace', location=job_location, progress_bar=True, credentials=credential)
Please remember to update the following parts accordingly based on your environment setup:
- Replace BigQuery destination table name.
- Replace project ID, credential file path and job location
- if_exists parameter value. The above code snippet will overwrite existing table.
Run the script
You can then run the script in Cloud Shell using the following command:
python xml-to-bq.py
Verify the result in BigQuery
Go to your BigQuery project and you will find a new table named xml_test is created successfully.
References
Pandas - Save DataFrame to BigQuery - Kontext
Read and Write XML Files with Python - Kontext