Fix PySpark TypeError: field **: **Type can not accept object ** in type <class '*'>

event 2019-07-10 visibility 33,548 comment 2 insights
more_vert
insights Stats
Raymond Raymond Spark & PySpark

Apache Spark installation guides, performance tuning tips, general tutorials, etc.

*Spark logo is a registered trademark of Apache Spark.


When creating Spark date frame using schemas, you may encounter errors about “field **: **Type can not accept object ** in type <class '*'>”.

The actual error can vary, for instances, the following are some examples:

  • field xxx: BooleanType can not accept object 100 in type <class 'int'>
  • field xxx: DecimalType can not accept object 100 in type <class 'int'>
  • field xxx: DecimalType can not accept object 100 in type <class 'str'>
  • field xxx: IntegerType can not accept object ‘String value’ in type <class 'str'>

The error is self-describing - a field in your DataFrame’s schema is defined as a type that is different from the actual data type.

One specific example

Let’s look at the following example:

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType, DecimalType

appName = "PySpark Example - Python Array/List to Spark Data Frame"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# List
data = [('Category A', 100, "This is category A"),
        ('Category B', 120, "This is category B"),
        ('Category C', 150, "This is category C")]

# Create a schema for the dataframe
schema = StructType([
    StructField('Category', StringType(), True),
    StructField('Count', DecimalType(), True),
    StructField('Description', StringType(), True)
])

# Convert list to RDD
rdd = spark.sparkContext.parallelize(data)

# Create data frame
df = spark.createDataFrame(rdd,schema)
print(df.schema)
df.show()

The second field is defined as decimal while it is actual integer when we create the data list.

Thus the following error will be thrown out:

TypeError: field Count: DecimalType(10,0) can not accept object 100 in type <class 'int'>

To fix it, we have at least two options.

Option 1 - change the definition of the schema

Since the data is defined as integer, we can change the schema definition to the following:

schema = StructType([
    StructField('Category', StringType(), True),
    StructField('Count', IntegerType(), True),
    StructField('Description', StringType(), True)
])

Option 2 - change the data type of RDD

from decimal import Decimal

data = [('Category A', Decimal(100), "This is category A"),
('Category B', Decimal(120), "This is category B"),
('Category C', Decimal(150), "This is category C")]

Summary

The error you got can be very different; however the approach to fix it can be similar to the above two options. Make a comment if you have a question or an issue you cannot fix.

More from Kontext
comment Comments
Raymond Raymond #1551 access_time 3 years ago more_vert

If you can share some code sample and also the sample data, I might be able to help you to fix your problem. 

format_quote

person Stupid access_time 3 years ago

My error is "TypeError: StructType can not accept object '***' in type <class 'str'>

so confused

S Stupid Human #1550 access_time 3 years ago more_vert

My error is "TypeError: StructType can not accept object '***' in type <class 'str'>

so confused

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts