Introduction to PySpark StructType and StructField
In Spark SQL, StructType
can be used to define a struct data type that include a list of StructField
. A StructField
can be any DataType
. One of the common usage is to define DataFrame's schema; another use case is to define UDF returned data type.
About DataType in Spark
The following table list all the supported data types in Spark.
Data type | Value type in Scala | API to access or create a data type |
---|---|---|
ByteType | Byte | ByteType |
ShortType | Short | ShortType |
IntegerType | Int | IntegerType |
LongType | Long | LongType |
FloatType | Float | FloatType |
DoubleType | Double | DoubleType |
DecimalType | java.math.BigDecimal | DecimalType |
StringType | String | StringType |
BinaryType | Array[Byte] | BinaryType |
BooleanType | Boolean | BooleanType |
TimestampType | java.sql.Timestamp | TimestampType |
DateType | java.sql.Date | DateType |
YearMonthIntervalType | java.time.Period | YearMonthIntervalType |
DayTimeIntervalType | java.time.Duration | DayTimeIntervalType |
ArrayType | scala.collection.Seq | ArrayType(elementType, [containsNull]) Note: The default value of containsNull is true. |
MapType | scala.collection.Map | MapType(keyType, valueType, [valueContainsNull]) Note: The default value of valueContainsNull is true. |
StructType | org.apache.spark.sql.Row | StructType(fields) Note: fields is a Seq of StructFields. Also, two fields with the same name are not allowed. |
StructField | The value type in Scala of the data type of this field(For example, Int for a StructField with the data type IntegerType) | StructField(name, dataType, [nullable]) Note: The default value of nullable is true. |
*Cited from Data Types - Spark 3.3.0 Documentation.
Use StructType and StructField to define schema
The following code snippet use StructType
and StructField
to define the schema for the DataFrame.
from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType app_name = "PySpark StructType and StructField Exmaple" master = "local" spark = SparkSession.builder \ .appName(app_name) \ .master(master) \ .getOrCreate() spark.sparkContext.setLogLevel("WARN") data = [['Hello Kontext!', 100], ['Hello Context!', 100]] # Define the schema for the input data schema = StructType([StructField('str_col', StringType(), nullable=True), StructField('int_col', IntegerType(), nullable=True)]) # Create a DataFrame with the schema provided df = spark.createDataFrame(data=data, schema=schema) print(df.schema) df.show()
Run the above PySpark script, the output looks like the following:
StructType([StructField('str_col', StringType(), True), StructField('int_col', IntegerType(), True)]) +--------------+-------+ | str_col|int_col| +--------------+-------+ |Hello Kontext!| 100| |Hello Context!| 100| +--------------+-------+
One thing to know is that StructField
and also use StructType
itself as data type. This is referred as nested struct type. Refer to PySpark DataFrame - Expand or Explode Nested StructType for some examples.
Use StructType and StructField in UDF
When creating user defined functions (UDF) in Spark, we can also explicitly specify the schema of returned data type though we can directly use @udf or @pandas_udf decorators to infer the schema.
The following code snippet provides one example of explicit schema for UDF.
from pyspark.sql.functions import udf @udf(IntegerType()) def custom_udf(str): return len(str) df = df.withColumn('str_len', custom_udf(df.str_col)) df.show()
Output:
+--------------+-------+-------+ | str_col|int_col|str_len| +--------------+-------+-------+ |Hello Kontext!| 100| 14| |Hello Context!| 100| 14| +--------------+-------+-------+