When creating Spark date frame using schemas, you may encounter errors about āfield **: **Type can not accept object ** in type <class '*'>ā.
The actual error can vary, for instances, the following are some examples:
- field xxx: BooleanType can not accept object 100 in type <class 'int'>
- field xxx: DecimalType can not accept object 100 in type <class 'int'>
- field xxx: DecimalType can not accept object 100 in type <class 'str'>
- field xxx: IntegerType can not accept object āString valueā in type <class 'str'>
- ā¦
The error is self-describing - a field in your DataFrameās schema is defined as a type that is different from the actual data type.
One specific example
Letās look at the following example:
from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType, DecimalType
appName = "PySpark Example - Python Array/List to Spark Data Frame"
master = "local"
# Create Spark session
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.getOrCreate()
# List
data = [('Category A', 100, "This is category A"),
('Category B', 120, "This is category B"),
('Category C', 150, "This is category C")]
# Create a schema for the dataframe
schema = StructType([
StructField('Category', StringType(), True),
StructField('Count', DecimalType(), True),
StructField('Description', StringType(), True)
])
# Convert list to RDD
rdd = spark.sparkContext.parallelize(data)
# Create data frame
df = spark.createDataFrame(rdd,schema)
print(df.schema)
df.show()
The second field is defined as decimal while it is actual integer when we create the data list.
Thus the following error will be thrown out:
TypeError: field Count: DecimalType(10,0) can not accept object 100 in type <class 'int'>
To fix it, we have at least two options.
Option 1 - change the definition of the schema
Since the data is defined as integer, we can change the schema definition to the following:
schema = StructType([
StructField('Category', StringType(), True),
StructField('Count', IntegerType(), True),
StructField('Description', StringType(), True)
])
Option 2 - change the data type of RDD
from decimal import Decimalā¦data = [('Category A', Decimal(100), "This is category A"),('Category B', Decimal(120), "This is category B"),('Category C', Decimal(150), "This is category C")]
Summary
The error you got can be very different; however the approach to fix it can be similar to the above two options. Make a comment if you have a question or an issue you cannot fix.