Scala: Change Column Type in Spark Data Frame
insights Stats
Apache Spark installation guides, performance tuning tips, general tutorials, etc.
*Spark logo is a registered trademark of Apache Spark.
This article shows how to change column types of Spark DataFrame using Scala. For example, convert StringType to DoubleType, StringType to Integer, StringType to DateType.
Construct a dataframe
Follow article Scala: Convert List to Spark Data Frame to construct a dataframe.
+----------+-----+------------------+ | Category|Count| Description| +----------+-----+------------------+ |Category A| 100|This is category A| |Category B| 120|This is category B| |Category C| 150|This is category C| +----------+-----+------------------+
Let's add two constant columns via lit function:
val df1 = df.withColumn("Str_Col1", lit("1")).withColumn( "Str_Col2", lit("2020-08-09")) df1.show() print(df1.schema)
Output:
scala> df1.show() +----------+-----+------------------+--------+----------+ | Category|Count| Description|Str_Col1| Str_Col2| +----------+-----+------------------+--------+----------+ |Category A| 100|This is category A| 1|2020-08-09| |Category B| 120|This is category B| 1|2020-08-09| |Category C| null|This is category C| 1|2020-08-09| +----------+-----+------------------+--------+----------+ scala> print(df1.schema) StructType(StructField(Category,StringType,true), StructField(Count,IntegerType,true), StructField(Description,StringType,true), StructField(Str_Col1,StringType,false), StructField(Str_Col2,StringType,false))
As printed out, current data types are StringType, IntegerType, StringType, StringType and StringType respectively.
Change column types using cast function
Function DataFrame.cast can be used to convert data types.
The following code snippet shows some of the commonly used conversions:
val df2 = df1.withColumn("Str_Col1_Int", $"Str_Col1".cast("int")).drop("Str_Col1").withColumn("Str_Col2_Date", $"Str_Col2".cast(DateType)).drop("Str_Col2") df2.show() print(df2.schema)
Output:
scala> df2.show() +----------+-----+------------------+------------+-------------+ | Category|Count| Description|Str_Col1_Int|Str_Col2_Date| +----------+-----+------------------+------------+-------------+ |Category A| 100|This is category A| 1| 2020-08-09| |Category B| 120|This is category B| 1| 2020-08-09| |Category C| null|This is category C| 1| 2020-08-09| +----------+-----+------------------+------------+-------------+ scala> print(df2.schema) StructType(StructField(Category,StringType,true), StructField(Count,IntegerType,true), StructField(Description,StringType,true), StructField(Str_Col1_Int,IntegerType,true), StructField(Str_Col2_Date,DateType,true))
As printed out, the two new columns are IntegerType and DataType.
import org.apache.spark.sql.types._
Run Spark code
You can easily run Spark code on your Windows or UNIX-alike (Linux, MacOS) systems. Follow these articles to setup your Spark environment if you don't have one yet: