PySpark DataFrame - Expand or Explode Nested StructType

visibility 4,503 event 2022-07-23 access_time 11 months ago language English
more_vert

PySpark DataFrame is like a table in a relational databases. It has rows and columns. However there is one major difference is that Spark DataFrame (or Dataset) can have complex data types for columns. For example, StructType is a complex type that can be used to define a struct column which can include many fields.

Create a DataFrame with StructType

First, let's create a Spark DataFrame using the following script:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

appName = "PySpark Example - Explode StructType"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

data = [{"id": 1, "customer_profile": {"name": "Kontext", "age": 3}},
        {"id": 2, "customer_profile": {"name": "Tech", "age": 10}}]

customer_schema = StructType([
    StructField('name', StringType(), True),
    StructField('age', IntegerType(), True),
])
df_schema = StructType([StructField("id", IntegerType(), True), StructField(
    "customer_profile", customer_schema, False)])
df = spark.createDataFrame(data, df_schema)
print(df.schema)
df.show()

The script is very simple - it creates a list of records and then define a schema to be used to create DataFrame.

The output looks like the following:

+---+----------------+
| id|customer_profile|
+---+----------------+
|  1|    {Kontext, 3}|
|  2|      {Tech, 10}|
+---+----------------+

The DataFrame schema is defined as :

StructType([StructField('id', IntegerType(), True), StructField('customer_profile', StructType([StructField('name', StringType(), True), StructField('age', IntegerType(), True)]), False)])

Column customer_profile is defined as StructType.

Expand the StructType

Now we can directly expand the StructType column using [column_name].[attribute_name] syntax. The following code snippet shows you how to do that:

df.select('*', "customer_profile.name", "customer_profile.age").show()

The DataFrame will have two additional attributes as shown below:

+---+----------------+-------+---+
| id|customer_profile|   name|age|
+---+----------------+-------+---+
|  1|    {Kontext, 3}|Kontext|  3|
|  2|      {Tech, 10}|   Tech| 10|
+---+----------------+-------+---+

We can also directly use [column_name].* to explode all attributes.

df.select('*', "customer_profile.*").show()

The result is the same as the previous one.

More from Kontext
copyright This page is subject to Site terms.
Like this article?
Share on
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts