Pandas - Save DataFrame to BigQuery

event 2021-03-17 visibility 7,935 comment 0 insights
more_vert
insights Stats
Pandas - Save DataFrame to BigQuery
Raymond Raymond Google Cloud Platform

Data analytics with Google Cloud Platform (GCP). For BigQuery SQL mentioned in the articles, they are all using standard SQL dialect unless specified differently. 

Many Python data analysts or engineers use Pandas to analyze data. It might be a common requirement to persist the transformed and calculated data to BigQuery once the analysis is done. This article expands on the previous article Load JSON File into BigQuery to provide one approach to save data frame to BigQuery with Python.

Prerequisites

Similar as Load JSON File into BigQuery, we need to use a credential to run BigQuery job to load data into it. The credential usually is generated from a service account with proper permissions/roles setup. Refer to that article about the details of setup credential file.

infoThere are different ways to instantiate a credential object in Python. For example, you can use credential JSON file or use OAuth. In this tutorial, credential JSON file will be used.

You will need the following ready to continue on this tutorial:

  • Google cloud service account credential file which has access to load data into BigQuery. At lease these permissions are required: bigquery.tables.create, bigquery.tables.updateData, bigquery.jobs.create.
  • Cloud Shell or other OS where you can access Google APIs.
  • Python with pandas and pandas-gbq package installed. 

If pandas package is not installed, please use the following command to install:

pip install pandas
# Or
pip3 install pandas
pip install pandas-gbq
# Or
pip3 install pandas-gbq

About to_gbq function

This tutorial directly use pandas DataFrame's to_gbq function to write into Google Cloud BigQuery. Refer to the API documentation for more details about this function: pandas.DataFrame.to_gbq — pandas 1.2.3 documentation (pydata.org).

The signature of the function looks like the following:

DataFrame.to_gbq(destination_table, project_id=None, chunksize=None, reauth=False, if_exists='fail', auth_local_webserver=False, table_schema=None, location=None, progress_bar=True, credentials=None)

Create a python script file

We start to create a python script file named pd-to-bq.py with the following content:

import pandas as pd
from datetime import datetime
from google.oauth2.service_account import Credentials

# Construct a DataFrame


def str_to_date(str):
    return datetime.strptime(str, '%Y-%m-%d').date()


data = [{'DATE': str_to_date('2020-01-01'), 'TYPE': 'TypeA', 'SALES': 1000},
        {'DATE': str_to_date('2020-01-01'), 'TYPE': 'TypeB', 'SALES': 200},
        {'DATE': str_to_date('2020-01-01'), 'TYPE': 'TypeC', 'SALES': 300},
        {'DATE': str_to_date('2020-02-01'), 'TYPE': 'TypeA', 'SALES': 700},
        {'DATE': str_to_date('2020-02-01'), 'TYPE': 'TypeB', 'SALES': 400},
        {'DATE': str_to_date('2020-02-01'), 'TYPE': 'TypeC', 'SALES': 500},
        {'DATE': str_to_date('2020-03-01'), 'TYPE': 'TypeA', 'SALES': 300},
        {'DATE': str_to_date('2020-03-01'), 'TYPE': 'TypeB', 'SALES': 900},
        {'DATE': str_to_date('2020-03-01'), 'TYPE': 'TypeC', 'SALES': 100}
        ]
df = pd.DataFrame(data)

# Define target table in BQ
target_table = "YOUR_DATA_SET.pandas"
project_id = "YOUR_PROJECT_ID"
credential_file = "PATH_TO_YOUR_SERVICE_ACCOUNT_CREDENTIAL_FILE.json"
credential = Credentials.from_service_account_file(credential_file)
# Location for BQ job, it needs to match with destination table location
job_location = "australia-southeast1"

# Save Pandas dataframe to BQ
df.to_gbq(target_table, project_id=project_id, if_exists='replace',
          location=job_location, progress_bar=True, credentials=credential)

About the script file

The script file does the following actions:

  • Construct a pandas DataFrame object in memory (from Pandas DataFrame Plot - Bar Chart). You can replace it with whichever way you feel comfortable to create a DataFrame. 
  • Then it defines a number of variables about target table in BigQuery, project ID, credentials and location to run the BigQuery data load job. Remember to replace these values accordingly. 
  • Finally it saves the results to BigQuery. 
warning Warning: the above script replaces the table if it exists. You can change values of parameter if_exists accordingly.

Once the script is run, the table will be created. Navigate to BigQuery, the preview of the newly created table looks like the following screenshot:

2021031794639-image.png

Summary

It is very easy to save DataFrame to BigQuery using pandas built-in function. If you run the script in Google compute engine, you can also use google.auth.compute_engine.Credentials object.

Let me know if you encounter any problems. 

More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts