Convert Pandas DataFrame to Spark DataFrame
Pandas DataFrame to Spark DataFrame
The following code snippet shows an example of converting Pandas DataFrame to Spark DataFrame:
import mysql.connector import pandas as pd from pyspark.sql import SparkSession appName = "PySpark MySQL Example - via mysql.connector" master = "local" spark = SparkSession.builder.master(master).appName(appName).getOrCreate() # Establish a connection conn = mysql.connector.connect(user='hive', database='test_db', password='hive', host="localhost", port=10101) cursor = conn.cursor() query = "SELECT id, value FROM test_table" # Create a pandas dataframe pdf = pd.read_sql(query, con=conn) conn.close() # Convert Pandas dataframe to spark DataFrame df = spark.createDataFrame(pdf) df.show()
In this code snippet, SparkSession.createDataFrame API is called to convert the Pandas DataFrame to Spark DataFrame. This function also has an optional parameter named schema which can be used to specify schema explicitly; Spark will infer the schema from Pandas schema if not specified.
Spark DaraFrame to Pandas DataFrame
The following code snippet convert a Spark DataFrame to a Pandas DataFrame:
pdf = df.toPandas()
Note: this action will cause all records in Spark DataFrame to be sent to driver application which may cause performance issues.
Performance improvement
To improve performance, Apache Arrow can be enabled in Spark for the conversions. Refer to this article for more details:
Improve PySpark Performance using Pandas UDF with Apache Arrow