Scala: Change Column Type in Spark Data Frame

Raymond Raymond event 2020-12-14 visibility 5,551
more_vert

This article shows how to change column types of Spark DataFrame using Scala. For example, convert StringType to DoubleType, StringType to Integer, StringType to DateType.

Construct a dataframe 

Follow article Scala: Convert List to Spark Data Frame to construct a dataframe.

+----------+-----+------------------+
|  Category|Count|       Description|
+----------+-----+------------------+
|Category A|  100|This is category A|
|Category B|  120|This is category B|
|Category C|  150|This is category C|
+----------+-----+------------------+

Let's add two constant columns via lit function:

val df1 = df.withColumn("Str_Col1", lit("1")).withColumn(
    "Str_Col2", lit("2020-08-09"))
df1.show()
print(df1.schema)

Output:

scala> df1.show()
+----------+-----+------------------+--------+----------+
|  Category|Count|       Description|Str_Col1|  Str_Col2|
+----------+-----+------------------+--------+----------+
|Category A|  100|This is category A|       1|2020-08-09|
|Category B|  120|This is category B|       1|2020-08-09|
|Category C| null|This is category C|       1|2020-08-09|
+----------+-----+------------------+--------+----------+

scala> print(df1.schema)
StructType(StructField(Category,StringType,true), StructField(Count,IntegerType,true), StructField(Description,StringType,true), StructField(Str_Col1,StringType,false), StructField(Str_Col2,StringType,false))

As printed out, current data types are StringType, IntegerType, StringType, StringType and StringType respectively.

Change column types using cast function

Function DataFrame.cast can be used to convert data types. 

The following code snippet shows some of the commonly used conversions:

val df2 = df1.withColumn("Str_Col1_Int", $"Str_Col1".cast("int")).drop("Str_Col1").withColumn("Str_Col2_Date", $"Str_Col2".cast(DateType)).drop("Str_Col2")
df2.show()
print(df2.schema)

Output:

scala> df2.show()
+----------+-----+------------------+------------+-------------+
|  Category|Count|       Description|Str_Col1_Int|Str_Col2_Date|
+----------+-----+------------------+------------+-------------+
|Category A|  100|This is category A|           1|   2020-08-09|
|Category B|  120|This is category B|           1|   2020-08-09|
|Category C| null|This is category C|           1|   2020-08-09|
+----------+-----+------------------+------------+-------------+


scala> print(df2.schema)
StructType(StructField(Category,StringType,true), StructField(Count,IntegerType,true), StructField(Description,StringType,true), StructField(Str_Col1_Int,IntegerType,true), StructField(Str_Col2_Date,DateType,true))

As printed out, the two new columns are IntegerType and DataType. 

info Tip: cast function are used differently: one is using implicit type string 'int' while the other one uses explicit type DateType. For the latter, you need to ensure class is imported.  In the session, the types are imported by the following line in the code snippet when constructing the data frame:
import org.apache.spark.sql.types._

Run Spark code

You can easily run Spark code on your Windows or UNIX-alike (Linux, MacOS) systems. Follow these articles to setup your Spark environment if you don't have one yet:

More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts