Introduction to PySpark StructType and StructField

In Spark SQL, StructType can be used to define a struct data type that include a list of StructField. A StructField can be any DataType. One of the common usage is to define DataFrame's schema; another use case is to define UDF returned data type.

About DataType in Spark

The following table list all the supported data types in Spark.

Data type	Value type in Scala	API to access or create a data type
ByteType	Byte	ByteType
ShortType	Short	ShortType
IntegerType	Int	IntegerType
LongType	Long	LongType
FloatType	Float	FloatType
DoubleType	Double	DoubleType
DecimalType	java.math.BigDecimal	DecimalType
StringType	String	StringType
BinaryType	Array[Byte]	BinaryType
BooleanType	Boolean	BooleanType
TimestampType	java.sql.Timestamp	TimestampType
DateType	java.sql.Date	DateType
YearMonthIntervalType	java.time.Period	YearMonthIntervalType
DayTimeIntervalType	java.time.Duration	DayTimeIntervalType
ArrayType	scala.collection.Seq	ArrayType(elementType, [containsNull]) Note: The default value of containsNull is true.
MapType	scala.collection.Map	MapType(keyType, valueType, [valueContainsNull]) Note: The default value of valueContainsNull is true.
StructType	org.apache.spark.sql.Row	StructType(fields) Note: fields is a Seq of StructFields. Also, two fields with the same name are not allowed.
StructField	The value type in Scala of the data type of this field(For example, Int for a StructField with the data type IntegerType)	StructField(name, dataType, [nullable]) Note: The default value of nullable is true.

*Cited from Data Types - Spark 3.3.0 Documentation.

Use StructType and StructField to define schema

The following code snippet use StructType and StructField to define the schema for the DataFrame.

infoInfo - Spark and infer schema from most of data sources. Explicit schema definition can be used to ensure input data source match with your target schema.

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

app_name = "PySpark StructType and StructField Exmaple"
master = "local"

spark = SparkSession.builder \
    .appName(app_name) \
    .master(master) \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

data = [['Hello Kontext!', 100], ['Hello Context!', 100]]

# Define the schema for the input data
schema = StructType([StructField('str_col', StringType(), nullable=True),
                     StructField('int_col', IntegerType(), nullable=True)])

# Create a DataFrame with the schema provided
df = spark.createDataFrame(data=data, schema=schema)

print(df.schema)

df.show()

Run the above PySpark script, the output looks like the following:

StructType([StructField('str_col', StringType(), True), StructField('int_col', IntegerType(), True)])

+--------------+-------+
|       str_col|int_col|
+--------------+-------+
|Hello Kontext!|    100|
|Hello Context!|    100|
+--------------+-------+

One thing to know is that StructField and also use StructType itself as data type. This is referred as nested struct type. Refer to PySpark DataFrame - Expand or Explode Nested StructType for some examples.

Use StructType and StructField in UDF

When creating user defined functions (UDF) in Spark, we can also explicitly specify the schema of returned data type though we can directly use @udf or @pandas_udf decorators to infer the schema.

The following code snippet provides one example of explicit schema for UDF.

from pyspark.sql.functions import udf

@udf(IntegerType())
def custom_udf(str):
    return len(str)

df = df.withColumn('str_len', custom_udf(df.str_col))

df.show()

Output:

+--------------+-------+-------+
|       str_col|int_col|str_len|
+--------------+-------+-------+
|Hello Kontext!|    100|     14|
|Hello Context!|    100|     14|
+--------------+-------+-------+

Introduction to PySpark StructType and StructField

About DataType in Spark

Use StructType and StructField to define schema

Use StructType and StructField in UDF

In this article