lite-log spark pyspark

Fix PySpark TypeError: field **: **Type can not accept object ** in type <class '*'>

574 views 0 comments about 4 months ago Raymond Tang

When creating Spark date frame using schemas, you may encounter errors about “field **: **Type can not accept object ** in type <class '*'>”.

The actual error can vary, for instances, the following are some examples:

  • field xxx: BooleanType can not accept object 100 in type <class 'int'>
  • field xxx: DecimalType can not accept object 100 in type <class 'int'>
  • field xxx: DecimalType can not accept object 100 in type <class 'str'>
  • field xxx: IntegerType can not accept object ‘String value’ in type <class 'str'>

The error is self-describing - a field in your DataFrame’s schema is defined as a type that is different from the actual data type.

One specific example

Let’s look at the following example:

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType, DecimalType

appName = "PySpark Example - Python Array/List to Spark Data Frame"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# List
data = [('Category A', 100, "This is category A"),
        ('Category B', 120, "This is category B"),
        ('Category C', 150, "This is category C")]

# Create a schema for the dataframe
schema = StructType([
    StructField('Category', StringType(), True),
    StructField('Count', DecimalType(), True),
    StructField('Description', StringType(), True)
])

# Convert list to RDD
rdd = spark.sparkContext.parallelize(data)

# Create data frame
df = spark.createDataFrame(rdd,schema)
print(df.schema)
df.show()

The second field is defined as decimal while it is actual integer when we create the data list.

Thus the following error will be thrown out:

TypeError: field Count: DecimalType(10,0) can not accept object 100 in type <class 'int'>

To fix it, we have at least two options.

Option 1 - change the definition of the schema

Since the data is defined as integer, we can change the schema definition to the following:

schema = StructType([
    StructField('Category', StringType(), True),
    StructField('Count', IntegerType(), True),
    StructField('Description', StringType(), True)
])

Option 2 - change the data type of RDD

from decimal import Decimal

data = [('Category A', Decimal(100), "This is category A"),

('Category B', Decimal(120), "This is category B"),

('Category C', Decimal(150), "This is category C")]

Summary

The error you got can be very different; however the approach to fix it can be similar to the above two options. Make a comment if you have a question or an issue you cannot fix.

Add comment

Comments (0)
No comments yet.