When creating Spark date frame using schemas, you may encounter errors about “field **: **Type can not accept object ** in type <class '*'>”.
The actual error can vary, for instances, the following are some examples:
- field xxx: BooleanType can not accept object 100 in type <class 'int'>
- field xxx: DecimalType can not accept object 100 in type <class 'int'>
- field xxx: DecimalType can not accept object 100 in type <class 'str'>
- field xxx: IntegerType can not accept object ‘String value’ in type <class 'str'>
- …
The error is self-describing - a field in your DataFrame’s schema is defined as a type that is different from the actual data type.
One specific example
Let’s look at the following example:
from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType, DecimalType
appName = "PySpark Example - Python Array/List to Spark Data Frame"
master = "local"
# Create Spark session
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.getOrCreate()
# List
data = [('Category A', 100, "This is category A"),
('Category B', 120, "This is category B"),
('Category C', 150, "This is category C")]
# Create a schema for the dataframe
schema = StructType([
StructField('Category', StringType(), True),
StructField('Count', DecimalType(), True),
StructField('Description', StringType(), True)
])
# Convert list to RDD
rdd = spark.sparkContext.parallelize(data)
# Create data frame
df = spark.createDataFrame(rdd,schema)
print(df.schema)
df.show()
The second field is defined as decimal while it is actual integer when we create the data list.
Thus the following error will be thrown out:
TypeError: field Count: DecimalType(10,0) can not accept object 100 in type <class 'int'>
To fix it, we have at least two options.
Option 1 - change the definition of the schema
Since the data is defined as integer, we can change the schema definition to the following:
schema = StructType([
StructField('Category', StringType(), True),
StructField('Count', IntegerType(), True),
StructField('Description', StringType(), True)
])
Option 2 - change the data type of RDD
from decimal import Decimal…data = [('Category A', Decimal(100), "This is category A"),('Category B', Decimal(120), "This is category B"),('Category C', Decimal(150), "This is category C")]
Summary
The error you got can be very different; however the approach to fix it can be similar to the above two options. Make a comment if you have a question or an issue you cannot fix.