Change Column Type in PySpark DataFrame

visibility 14,115 access_time 2 years ago languageEnglish

This article shows how to change column types of Spark DataFrame using Python. For example, convert StringType to DoubleType, StringType to Integer, StringType to DateType.

Construct a dataframe 

Follow article Convert Python Dictionary List to PySpark DataFrame to construct a dataframe.

+----------+---+------+
|  Category| ID| Value|
+----------+---+------+
|Category A|  1| 12.40|
|Category B|  2| 30.10|
|Category C|  3|100.01|
+----------+---+------+

Let's add two constant columns via lit function:

from pyspark.sql.functions import lit

df1 = df.withColumn('Str_Col1', lit('1')).withColumn(
    'Str_Col2', lit('2020-08-09'))
df1.show()
print(df1.schema)

Output:

+----------+---+------+--------+----------+
|  Category| ID| Value|Str_Col1|  Str_Col2|
+----------+---+------+--------+----------+
|Category A|  1| 12.40|       1|2020-08-09|
|Category B|  2| 30.10|       1|2020-08-09|
|Category C|  3|100.01|       1|2020-08-09|
+----------+---+------+--------+----------+

StructType(List(StructField(Category,StringType,false),StructField(ID,IntegerType,false),StructField(Value,DecimalType(10,2),true),StructField(Str_Col1,StringType,false),StructField(Str_Col2,StringType,false)))

As printed out, current data types are StringType, IntegerType, DecimalType, StringType and StringType.

Change column types using cast function

Function DataFrame.cast can be used to convert data types. 

The following code snippet shows some of the commonly used conversions:

from pyspark.sql.types import DateType
df1 = df1.withColumn("Str_Col1_Int", df1['Str_Col1'].cast('int')).drop('Str_Col1') \
    .withColumn('Str_Col2_Date', df1['Str_Col2'].cast(DateType())).drop('Str_Col2')
df1.show()
print(df1.schema)

Output:

+----------+---+------+------------+-------------+
|  Category| ID| Value|Str_Col1_Int|Str_Col2_Date|
+----------+---+------+------------+-------------+
|Category A|  1| 12.40|           1|   2020-08-09|
|Category B|  2| 30.10|           1|   2020-08-09|
|Category C|  3|100.01|           1|   2020-08-09|
+----------+---+------+------------+-------------+

StructType(List(StructField(Category,StringType,false),StructField(ID,IntegerType,false),StructField(Value,DecimalType(10,2),true),StructField(Str_Col1_Int,IntegerType,true),StructField(Str_Col2_Date,DateType,true)))

As printed out, the two new columns are IntegerType and DataType. 

info Tip: cast function are used differently: one is using implicit type string 'int' while the other one uses explicit type DateType. For the latter, you need to ensure class is imported. 

Run Spark code

You can easily run Spark code on your Windows or UNIX-alike (Linux, MacOS) systems. Follow these articles to setup your Spark environment if you don't have one yet:

info Last modified by Raymond 2 years ago copyright This page is subject to Site terms.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

timeline Stats
Page index 21.71
More from Kontext
Spark repartition vs. coalesce
visibility 373
thumb_up 0
access_time 2 months ago
Load CSV File in PySpark
visibility 9,095
thumb_up 0
access_time 2 years ago