When creating Spark date frame using schemas, you may encounter errors about “field **: **Type can not accept object ** in type <class '*'>”.

The actual error can vary, for instances, the following are some examples:

  • field xxx: BooleanType can not accept object 100 in type <class 'int'>
  • field xxx: DecimalType can not accept object 100 in type <class 'int'>
  • field xxx: DecimalType can not accept object 100 in type <class 'str'>
  • field xxx: IntegerType can not accept object ‘String value’ in type <class 'str'>

The error is self-describing - a field in your DataFrame’s schema is defined as a type that is different from the actual data type.

One specific example

Let’s look at the following example:

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType, DecimalType

appName = "PySpark Example - Python Array/List to Spark Data Frame"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# List
data = [('Category A', 100, "This is category A"),
        ('Category B', 120, "This is category B"),
        ('Category C', 150, "This is category C")]

# Create a schema for the dataframe
schema = StructType([
    StructField('Category', StringType(), True),
    StructField('Count', DecimalType(), True),
    StructField('Description', StringType(), True)
])

# Convert list to RDD
rdd = spark.sparkContext.parallelize(data)

# Create data frame
df = spark.createDataFrame(rdd,schema)
print(df.schema)
df.show()

The second field is defined as decimal while it is actual integer when we create the data list.

Thus the following error will be thrown out:

TypeError: field Count: DecimalType(10,0) can not accept object 100 in type <class 'int'>

To fix it, we have at least two options.

Option 1 - change the definition of the schema

Since the data is defined as integer, we can change the schema definition to the following:

schema = StructType([
    StructField('Category', StringType(), True),
    StructField('Count', IntegerType(), True),
    StructField('Description', StringType(), True)
])

Option 2 - change the data type of RDD

from decimal import Decimal

data = [('Category A', Decimal(100), "This is category A"),
('Category B', Decimal(120), "This is category B"),
('Category C', Decimal(150), "This is category C")]

Summary

The error you got can be very different; however the approach to fix it can be similar to the above two options. Make a comment if you have a question or an issue you cannot fix.

info Last modified by Raymond at 7 months ago * This page is subject to Site terms.

More from Kontext

PySpark Read Multiple Lines Records from CSV

local_offer pyspark local_offer spark-2-x local_offer python

visibility 422
thumb_up 0
access_time 4 months ago

CSV is a common format used when extracting and exchanging data between systems and platforms. Once CSV file is ingested into HDFS, you can easily read them as DataFrame in Spark. However there are a few options you need to pay attention to especially if you source file: Has records ac...

open_in_new Spark + PySpark

local_offer pyspark local_offer spark-2-x local_offer teradata local_offer SQL Server

visibility 1211
thumb_up 0
access_time 4 months ago

In my previous article about  Connect to SQL Server in Spark (PySpark) , I mentioned the ways t...

open_in_new Spark + PySpark

Spark Read from SQL Server Source using Windows/Kerberos Authentication

local_offer pyspark local_offer SQL Server local_offer spark-2-x

visibility 485
thumb_up 0
access_time 6 months ago

In this article, I am going to show you how to use JDBC Kerberos authentication to connect to SQL Server sources in Spark (PySpark). I will use  Kerberos connection with principal names and password directly that requires  ...

open_in_new Spark + PySpark

Schema Merging (Evolution) with Parquet in Spark and Hive

local_offer parquet local_offer pyspark local_offer spark-2-x local_offer hive local_offer hdfs

visibility 2209
thumb_up 0
access_time 6 months ago

Schema evolution is supported by many frameworks or data serialization systems such as Avro, Orc, Protocol Buffer and Parquet. With schema evolution, one set of data can be stored in multiple files with different but compatible schema. In Spark, Parquet data source can detect and merge sch...

open_in_new Spark + PySpark

info About author

comment Comments (0)

comment Add comment

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

No comments yet.

Dark theme mode

Dark theme mode is available on Kontext.

Learn more arrow_forward

Kontext Column

Created for everyone to publish data, programming and cloud related articles. Follow three steps to create your columns.


Learn more arrow_forward