Change Column Type in PySpark DataFrame
This article shows how to change column types of Spark DataFrame using Python. For example, convert StringType to DoubleType, StringType to Integer, StringType to DateType.
Construct a dataframe
Follow article Convert Python Dictionary List to PySpark DataFrame to construct a dataframe.
+----------+---+------+ | Category| ID| Value| +----------+---+------+ |Category A| 1| 12.40| |Category B| 2| 30.10| |Category C| 3|100.01| +----------+---+------+
Let's add two constant columns via lit function:
from pyspark.sql.functions import lit df1 = df.withColumn('Str_Col1', lit('1')).withColumn( 'Str_Col2', lit('2020-08-09')) df1.show() print(df1.schema)
Output:
+----------+---+------+--------+----------+ | Category| ID| Value|Str_Col1| Str_Col2| +----------+---+------+--------+----------+ |Category A| 1| 12.40| 1|2020-08-09| |Category B| 2| 30.10| 1|2020-08-09| |Category C| 3|100.01| 1|2020-08-09| +----------+---+------+--------+----------+ StructType(List(StructField(Category,StringType,false),StructField(ID,IntegerType,false),StructField(Value,DecimalType(10,2),true),StructField(Str_Col1,StringType,false),StructField(Str_Col2,StringType,false)))
As printed out, current data types are StringType, IntegerType, DecimalType, StringType and StringType.
Change column types using cast function
Function DataFrame.cast can be used to convert data types.
The following code snippet shows some of the commonly used conversions:
from pyspark.sql.types import DateType df1 = df1.withColumn("Str_Col1_Int", df1['Str_Col1'].cast('int')).drop('Str_Col1') \ .withColumn('Str_Col2_Date', df1['Str_Col2'].cast(DateType())).drop('Str_Col2') df1.show() print(df1.schema)
Output:
+----------+---+------+------------+-------------+ | Category| ID| Value|Str_Col1_Int|Str_Col2_Date| +----------+---+------+------------+-------------+ |Category A| 1| 12.40| 1| 2020-08-09| |Category B| 2| 30.10| 1| 2020-08-09| |Category C| 3|100.01| 1| 2020-08-09| +----------+---+------+------------+-------------+ StructType(List(StructField(Category,StringType,false),StructField(ID,IntegerType,false),StructField(Value,DecimalType(10,2),true),StructField(Str_Col1_Int,IntegerType,true),StructField(Str_Col2_Date,DateType,true)))
As printed out, the two new columns are IntegerType and DataType.
Run Spark code
You can easily run Spark code on your Windows or UNIX-alike (Linux, MacOS) systems. Follow these articles to setup your Spark environment if you don't have one yet: