Fix PySpark TypeError: field **: **Type can not accept object ** in type <class '*'>
insights Stats
Apache Spark installation guides, performance tuning tips, general tutorials, etc.
*Spark logo is a registered trademark of Apache Spark.
When creating Spark date frame using schemas, you may encounter errors about “field **: **Type can not accept object ** in type <class '*'>”.
The actual error can vary, for instances, the following are some examples:
- field xxx: BooleanType can not accept object 100 in type <class 'int'>
- field xxx: DecimalType can not accept object 100 in type <class 'int'>
- field xxx: DecimalType can not accept object 100 in type <class 'str'>
- field xxx: IntegerType can not accept object ‘String value’ in type <class 'str'>
- …
The error is self-describing - a field in your DataFrame’s schema is defined as a type that is different from the actual data type.
One specific example
Let’s look at the following example:
from pyspark.sql import SparkSession from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType, DecimalType appName = "PySpark Example - Python Array/List to Spark Data Frame" master = "local" # Create Spark session spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .getOrCreate() # List data = [('Category A', 100, "This is category A"), ('Category B', 120, "This is category B"), ('Category C', 150, "This is category C")] # Create a schema for the dataframe schema = StructType([ StructField('Category', StringType(), True), StructField('Count', DecimalType(), True), StructField('Description', StringType(), True) ]) # Convert list to RDD rdd = spark.sparkContext.parallelize(data) # Create data frame df = spark.createDataFrame(rdd,schema) print(df.schema) df.show()
The second field is defined as decimal while it is actual integer when we create the data list.
Thus the following error will be thrown out:
TypeError: field Count: DecimalType(10,0) can not accept object 100 in type <class 'int'>
To fix it, we have at least two options.
Option 1 - change the definition of the schema
Since the data is defined as integer, we can change the schema definition to the following:
schema = StructType([ StructField('Category', StringType(), True), StructField('Count', IntegerType(), True), StructField('Description', StringType(), True) ])
Option 2 - change the data type of RDD
from decimal import Decimal
…
data = [('Category A', Decimal(100), "This is category A"),
('Category B', Decimal(120), "This is category B"),
('Category C', Decimal(150), "This is category C")]
Summary
The error you got can be very different; however the approach to fix it can be similar to the above two options. Make a comment if you have a question or an issue you cannot fix.
person Stupid access_time 3 years ago
My error is "TypeError: StructType can not accept object '***' in type <class 'str'>"
so confused
My error is "TypeError: StructType can not accept object '***' in type <class 'str'>"
so confused
If you can share some code sample and also the sample data, I might be able to help you to fix your problem.