Change Column Type in PySpark DataFrame

access_time 2 months ago visibility146 comment 0

This article shows how to change column types of Spark DataFrame using Python. For example, convert StringType to DoubleType, StringType to Integer, StringType to DateType.

Construct a dataframe 

Follow article Convert Python Dictionary List to PySpark DataFrame to construct a dataframe.

+----------+---+------+
|  Category| ID| Value|
+----------+---+------+
|Category A|  1| 12.40|
|Category B|  2| 30.10|
|Category C|  3|100.01|
+----------+---+------+

Let's add two constant columns via lit function:

from pyspark.sql.functions import lit

df1 = df.withColumn('Str_Col1', lit('1')).withColumn(
    'Str_Col2', lit('2020-08-09'))
df1.show()
print(df1.schema)

Output:

+----------+---+------+--------+----------+
|  Category| ID| Value|Str_Col1|  Str_Col2|
+----------+---+------+--------+----------+
|Category A|  1| 12.40|       1|2020-08-09|
|Category B|  2| 30.10|       1|2020-08-09|
|Category C|  3|100.01|       1|2020-08-09|
+----------+---+------+--------+----------+

StructType(List(StructField(Category,StringType,false),StructField(ID,IntegerType,false),StructField(Value,DecimalType(10,2),true),StructField(Str_Col1,StringType,false),StructField(Str_Col2,StringType,false)))

As printed out, current data types are StringType, IntegerType, DecimalType, StringType and StringType.

Change column types using cast function

Function DataFrame.cast can be used to convert data types. 

The following code snippet shows some of the commonly used conversions:

from pyspark.sql.types import DateType
df1 = df1.withColumn("Str_Col1_Int", df1['Str_Col1'].cast('int')).drop('Str_Col1') \
    .withColumn('Str_Col2_Date', df1['Str_Col2'].cast(DateType())).drop('Str_Col2')
df1.show()
print(df1.schema)

Output:

+----------+---+------+------------+-------------+
|  Category| ID| Value|Str_Col1_Int|Str_Col2_Date|
+----------+---+------+------------+-------------+
|Category A|  1| 12.40|           1|   2020-08-09|
|Category B|  2| 30.10|           1|   2020-08-09|
|Category C|  3|100.01|           1|   2020-08-09|
+----------+---+------+------------+-------------+

StructType(List(StructField(Category,StringType,false),StructField(ID,IntegerType,false),StructField(Value,DecimalType(10,2),true),StructField(Str_Col1_Int,IntegerType,true),StructField(Str_Col2_Date,DateType,true)))

As printed out, the two new columns are IntegerType and DataType. 

info Tip: cast function are used differently: one is using implicit type string 'int' while the other one uses explicit type DateType. For the latter, you need to ensure class is imported. 

Run Spark code

You can easily run Spark code on your Windows or UNIX-alike (Linux, MacOS) systems. Follow these articles to setup your Spark environment if you don't have one yet:

info Last modified by Administrator at 2 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Kontext Column

Created for everyone to publish data, programming and cloud related articles.
Follow three steps to create your columns.


Learn more arrow_forward

More from Kontext

Spark Structured Streaming - Read from and Write into Kafka Topics

local_offer spark local_offer kafka

visibility 99
thumb_up 0
access_time 19 days ago

Spark structured streaming provides rich APIs to read from and write to Kafka topics. When reading from Kafka, Kafka sources can be created for both streaming and batch queries. When writing into Kafka, Kafka sinks can be created as destination for both streaming and batch queries too.  ...

local_offer spark local_offer hadoop local_offer yarn local_offer oozie local_offer spark-advanced

visibility 1731
thumb_up 0
access_time 2 years ago

Recently I created an Oozie workflow which contains one Spark action. The Spark action master is yarn and deploy mode is cluster. Each time when the job runs about 30 minutes, the application fails with errors like the following: Application application_** failed 2 times due to AM Container for ...

Improve PySpark Performance using Pandas UDF with Apache Arrow

local_offer pyspark local_offer spark local_offer spark-2-x local_offer pandas local_offer spark-advanced

visibility 3413
thumb_up 4
access_time 10 months ago

Apache Arrow is an in-memory columnar data format that can be used in Spark to efficiently transfer data between JVM and Python processes. This currently is most beneficial to Python users that work with Pandas/NumPy data. In this article, I'm going to show you how to utilise Pandas UDF in ...

About column

Apache Spark installation guides, performance tuning tips, general tutorials, etc.

*Spark logo is a registered trademark of Apache Spark.

rss_feed Subscribe RSS