Pandas - Save DataFrame to BigQuery
insights Stats
Data analytics with Google Cloud Platform (GCP). For BigQuery SQL mentioned in the articles, they are all using standard SQL dialect unless specified differently.
Many Python data analysts or engineers use Pandas to analyze data. It might be a common requirement to persist the transformed and calculated data to BigQuery once the analysis is done. This article expands on the previous article Load JSON File into BigQuery to provide one approach to save data frame to BigQuery with Python.
Prerequisites
Similar as Load JSON File into BigQuery, we need to use a credential to run BigQuery job to load data into it. The credential usually is generated from a service account with proper permissions/roles setup. Refer to that article about the details of setup credential file.
You will need the following ready to continue on this tutorial:
- Google cloud service account credential file which has access to load data into BigQuery. At lease these permissions are required: bigquery.tables.create, bigquery.tables.updateData, bigquery.jobs.create.
- Cloud Shell or other OS where you can access Google APIs.
- Python with pandas and pandas-gbq package installed.
If pandas package is not installed, please use the following command to install:
pip install pandas # Or pip3 install pandas
pip install pandas-gbq # Or pip3 install pandas-gbq
About to_gbq function
This tutorial directly use pandas DataFrame's to_gbq function to write into Google Cloud BigQuery. Refer to the API documentation for more details about this function: pandas.DataFrame.to_gbq — pandas 1.2.3 documentation (pydata.org).
The signature of the function looks like the following:
DataFrame.to_gbq(destination_table, project_id=None, chunksize=None, reauth=False, if_exists='fail', auth_local_webserver=False, table_schema=None, location=None, progress_bar=True, credentials=None)
Create a python script file
We start to create a python script file named pd-to-bq.py with the following content:
import pandas as pd from datetime import datetime from google.oauth2.service_account import Credentials # Construct a DataFrame def str_to_date(str): return datetime.strptime(str, '%Y-%m-%d').date() data = [{'DATE': str_to_date('2020-01-01'), 'TYPE': 'TypeA', 'SALES': 1000}, {'DATE': str_to_date('2020-01-01'), 'TYPE': 'TypeB', 'SALES': 200}, {'DATE': str_to_date('2020-01-01'), 'TYPE': 'TypeC', 'SALES': 300}, {'DATE': str_to_date('2020-02-01'), 'TYPE': 'TypeA', 'SALES': 700}, {'DATE': str_to_date('2020-02-01'), 'TYPE': 'TypeB', 'SALES': 400}, {'DATE': str_to_date('2020-02-01'), 'TYPE': 'TypeC', 'SALES': 500}, {'DATE': str_to_date('2020-03-01'), 'TYPE': 'TypeA', 'SALES': 300}, {'DATE': str_to_date('2020-03-01'), 'TYPE': 'TypeB', 'SALES': 900}, {'DATE': str_to_date('2020-03-01'), 'TYPE': 'TypeC', 'SALES': 100} ] df = pd.DataFrame(data) # Define target table in BQ target_table = "YOUR_DATA_SET.pandas" project_id = "YOUR_PROJECT_ID" credential_file = "PATH_TO_YOUR_SERVICE_ACCOUNT_CREDENTIAL_FILE.json" credential = Credentials.from_service_account_file(credential_file) # Location for BQ job, it needs to match with destination table location job_location = "australia-southeast1" # Save Pandas dataframe to BQ df.to_gbq(target_table, project_id=project_id, if_exists='replace', location=job_location, progress_bar=True, credentials=credential)
About the script file
The script file does the following actions:
- Construct a pandas DataFrame object in memory (from Pandas DataFrame Plot - Bar Chart). You can replace it with whichever way you feel comfortable to create a DataFrame.
- Then it defines a number of variables about target table in BigQuery, project ID, credentials and location to run the BigQuery data load job. Remember to replace these values accordingly.
- Finally it saves the results to BigQuery.
Once the script is run, the table will be created. Navigate to BigQuery, the preview of the newly created table looks like the following screenshot:
Summary
It is very easy to save DataFrame to BigQuery using pandas built-in function. If you run the script in Google compute engine, you can also use google.auth.compute_engine.Credentials object.
Let me know if you encounter any problems.