PySpark DataFrame - Expand or Explode Nested StructType
PySpark DataFrame is like a table in a relational databases. It has rows and columns. However there is one major difference is that Spark DataFrame (or Dataset) can have complex data types for columns. For example, StructType
is a complex type that can be used to define a struct column which can include many fields.
Create a DataFrame with StructType
First, let's create a Spark DataFrame using the following script:
from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType appName = "PySpark Example - Explode StructType" master = "local" # Create Spark session spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .getOrCreate() spark.sparkContext.setLogLevel("WARN") data = [{"id": 1, "customer_profile": {"name": "Kontext", "age": 3}}, {"id": 2, "customer_profile": {"name": "Tech", "age": 10}}] customer_schema = StructType([ StructField('name', StringType(), True), StructField('age', IntegerType(), True), ]) df_schema = StructType([StructField("id", IntegerType(), True), StructField( "customer_profile", customer_schema, False)]) df = spark.createDataFrame(data, df_schema) print(df.schema) df.show()
The script is very simple - it creates a list of records and then define a schema to be used to create DataFrame.
The output looks like the following:
+---+----------------+ | id|customer_profile| +---+----------------+ | 1| {Kontext, 3}| | 2| {Tech, 10}| +---+----------------+
The DataFrame schema is defined as :
StructType([StructField('id', IntegerType(), True), StructField('customer_profile', StructType([StructField('name', StringType(), True), StructField('age', IntegerType(), True)]), False)])
Column customer_profile
is defined as StructType.
Expand the StructType
Now we can directly expand the StructType column using [column_name].[attribute_name]
syntax. The following code snippet shows you how to do that:
df.select('*', "customer_profile.name", "customer_profile.age").show()
The DataFrame will have two additional attributes as shown below:
+---+----------------+-------+---+ | id|customer_profile| name|age| +---+----------------+-------+---+ | 1| {Kontext, 3}|Kontext| 3| | 2| {Tech, 10}| Tech| 10| +---+----------------+-------+---+
We can also directly use [column_name].*
to explode all attributes.
df.select('*', "customer_profile.*").show()
The result is the same as the previous one.