PySpark DataFrame - Expand or Explode Nested StructType

Kontext Kontext event 2022-07-23 visibility 8,619
more_vert

PySpark DataFrame is like a table in a relational databases. It has rows and columns. However there is one major difference is that Spark DataFrame (or Dataset) can have complex data types for columns. For example, StructType is a complex type that can be used to define a struct column which can include many fields.

Create a DataFrame with StructType

First, let's create a Spark DataFrame using the following script:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

appName = "PySpark Example - Explode StructType"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

data = [{"id": 1, "customer_profile": {"name": "Kontext", "age": 3}},
        {"id": 2, "customer_profile": {"name": "Tech", "age": 10}}]

customer_schema = StructType([
    StructField('name', StringType(), True),
    StructField('age', IntegerType(), True),
])
df_schema = StructType([StructField("id", IntegerType(), True), StructField(
    "customer_profile", customer_schema, False)])
df = spark.createDataFrame(data, df_schema)
print(df.schema)
df.show()

The script is very simple - it creates a list of records and then define a schema to be used to create DataFrame.

The output looks like the following:

+---+----------------+
| id|customer_profile|
+---+----------------+
|  1|    {Kontext, 3}|
|  2|      {Tech, 10}|
+---+----------------+

The DataFrame schema is defined as :

StructType([StructField('id', IntegerType(), True), StructField('customer_profile', StructType([StructField('name', StringType(), True), StructField('age', IntegerType(), True)]), False)])

Column customer_profile is defined as StructType.

Expand the StructType

Now we can directly expand the StructType column using [column_name].[attribute_name] syntax. The following code snippet shows you how to do that:

df.select('*', "customer_profile.name", "customer_profile.age").show()

The DataFrame will have two additional attributes as shown below:

+---+----------------+-------+---+
| id|customer_profile|   name|age|
+---+----------------+-------+---+
|  1|    {Kontext, 3}|Kontext|  3|
|  2|      {Tech, 10}|   Tech| 10|
+---+----------------+-------+---+

We can also directly use [column_name].* to explode all attributes.

df.select('*', "customer_profile.*").show()

The result is the same as the previous one.

More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts