Scala: Change Column Type in Spark Data Frame

access_time 2 months ago visibility20 comment 0

This article shows how to change column types of Spark DataFrame using Scala. For example, convert StringType to DoubleType, StringType to Integer, StringType to DateType.

Construct a dataframe 

Follow article Scala: Convert List to Spark Data Frame to construct a dataframe.

+----------+-----+------------------+
|  Category|Count|       Description|
+----------+-----+------------------+
|Category A|  100|This is category A|
|Category B|  120|This is category B|
|Category C|  150|This is category C|
+----------+-----+------------------+

Let's add two constant columns via lit function:

val df1 = df.withColumn("Str_Col1", lit("1")).withColumn(
    "Str_Col2", lit("2020-08-09"))
df1.show()
print(df1.schema)

Output:

scala> df1.show()
+----------+-----+------------------+--------+----------+
|  Category|Count|       Description|Str_Col1|  Str_Col2|
+----------+-----+------------------+--------+----------+
|Category A|  100|This is category A|       1|2020-08-09|
|Category B|  120|This is category B|       1|2020-08-09|
|Category C| null|This is category C|       1|2020-08-09|
+----------+-----+------------------+--------+----------+

scala> print(df1.schema)
StructType(StructField(Category,StringType,true), StructField(Count,IntegerType,true), StructField(Description,StringType,true), StructField(Str_Col1,StringType,false), StructField(Str_Col2,StringType,false))

As printed out, current data types are StringType, IntegerType, StringType, StringType and StringType respectively.

Change column types using cast function

Function DataFrame.cast can be used to convert data types. 

The following code snippet shows some of the commonly used conversions:

val df2 = df1.withColumn("Str_Col1_Int", $"Str_Col1".cast("int")).drop("Str_Col1").withColumn("Str_Col2_Date", $"Str_Col2".cast(DateType)).drop("Str_Col2")
df2.show()
print(df2.schema)

Output:

scala> df2.show()
+----------+-----+------------------+------------+-------------+
|  Category|Count|       Description|Str_Col1_Int|Str_Col2_Date|
+----------+-----+------------------+------------+-------------+
|Category A|  100|This is category A|           1|   2020-08-09|
|Category B|  120|This is category B|           1|   2020-08-09|
|Category C| null|This is category C|           1|   2020-08-09|
+----------+-----+------------------+------------+-------------+


scala> print(df2.schema)
StructType(StructField(Category,StringType,true), StructField(Count,IntegerType,true), StructField(Description,StringType,true), StructField(Str_Col1_Int,IntegerType,true), StructField(Str_Col2_Date,DateType,true))

As printed out, the two new columns are IntegerType and DataType. 

info Tip: cast function are used differently: one is using implicit type string 'int' while the other one uses explicit type DateType. For the latter, you need to ensure class is imported.  In the session, the types are imported by the following line in the code snippet when constructing the data frame:
import org.apache.spark.sql.types._

Run Spark code

You can easily run Spark code on your Windows or UNIX-alike (Linux, MacOS) systems. Follow these articles to setup your Spark environment if you don't have one yet:

info Last modified by Raymond 30 days ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Follow Kontext

Get our latest updates on LinkedIn or Twitter.

Want to publish your article on Kontext?

Learn more

More from Kontext

visibility 52062
thumb_up 14
access_time 2 years ago

Data partitioning is critical to data processing performance especially for large volume of data processing in Spark. Partitions in Spark won’t span across nodes though one node can contains more than one partitions. When processing, Spark assigns one task for each partition and each worker threads ...

visibility 6069
thumb_up 0
access_time 2 years ago

In my article Connect to Teradata database through Python , I demonstrated about how to use Teradata python package or Teradata ODBC driver to connect to Teradata. In this article, I’m going to show you how to connect to Teradata through JDBC drivers so that you can load data directly into PySpark ...

visibility 17
thumb_up 0
access_time 2 months ago

This article shows how to 'remove' column from Spark data frame using Scala.  Follow article  Scala: Convert List to Spark Data Frame to construct a data frame. The DataFrame object looks like the following:  +----------+-----+------------------+ | Category|Count| ...