Scala: Change Data Frame Column Names in Spark

access_time 2 months ago visibility17 comment 0

Column renaming is a common action when working with data frames. In this article, I will show you how to rename column names in a Spark data frame using Scala. 

infoThis is the Scala version of article: Change DataFrame Column Names in PySpark

Construct a dataframe 

The following code snippet creates a DataFrame from an array of Scala list. Spark SQL types are used to create the schema and then SparkSession.createDataFrame function is used to convert the array of list to a Spark DataFrame object. 

import org.apache.spark.sql._
import org.apache.spark.sql.types._

val data = 
Array(List("Category A", 100, "This is category A"),
List("Category B", 120, "This is category B"),
List("Category C", 150, "This is category C"))

// Create a schema for the dataframe
val schema =
  StructType(
    StructField("Category", StringType, true) ::
    StructField("Count", IntegerType, true) ::
    StructField("Description", StringType, true) :: Nil)

// Convert list to List of Row
val rows = data.map(t=>Row(t(0),t(1),t(2))).toList

// Create RDD
val rdd = spark.sparkContext.parallelize(rows)

// Create data frame
val df = spark.createDataFrame(rdd,schema)
print(df.schema)
df.show()

The data frame looks like the following:

+----------+-----+------------------+
|  Category|Count|       Description|
+----------+-----+------------------+
|Category A|  100|This is category A|
|Category B|  120|This is category B|
|Category C|  150|This is category C|
+----------+-----+------------------+

Print out column names

DataFrame.columns can be used to print out column list of the data frame:

print(df.columns.toList)

Output:

List(Category, Count, Description)

Rename one column

We can use withColumnRenamed function to change column names.

val df2 = df.withColumnRenamed("Category", "category_new")
df2.show()

Output:

scala> df2.show()
+------------+-----+------------------+
|category_new|Count|       Description|
+------------+-----+------------------+
|  Category A|  100|This is category A|
|  Category B|  120|This is category B|
|  Category C|  150|This is category C|
+------------+-----+------------------+

Column Category is renamed to category_new.

Rename all columns

Function toDF can be used to rename all column names. The following code snippet converts all column names to lower case and then append '_new' to each column name.

# Rename columns
val new_column_names=df.columns.map(c=>c.toLowerCase() + "_new")
val df3 = df.toDF(new_column_names:_*)
df3.show()

Output:

scala> df3.show()
+------------+---------+------------------+
|category_new|count_new|   description_new|
+------------+---------+------------------+
|  Category A|      100|This is category A|
|  Category B|      120|This is category B|
|  Category C|      150|This is category C|
+------------+---------+------------------+

You can use similar approach to remove spaces or special characters from column names.

infoIn Scala, _* is used to unpack a list or array. For this example, the parameter is String*. 

Use Spark SQL

Of course, you can also use Spark SQL to rename columns like the following code snippet shows:

df.createOrReplaceTempView("df")
spark.sql("select Category as category_new, Count as count_new, Description as description_new from df").show()

The above code snippet first register the dataframe as a temp view. And then Spark SQL is used to change column names.

Output:

scala> spark.sql("select Category as category_new, Count as count_new, Description as description_new from df").show()
+------------+---------+------------------+
|category_new|count_new|   description_new|
+------------+---------+------------------+
|  Category A|      100|This is category A|
|  Category B|      120|This is category B|
|  Category C|      150|This is category C|
+------------+---------+------------------+

Run Spark code

You can easily run Spark code on your Windows or UNIX-alike (Linux, MacOS) systems. Follow these articles to setup your Spark environment if you don't have one yet:

info Last modified by Raymond 30 days ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Follow Kontext

Get our latest updates on LinkedIn or Twitter.

Want to publish your article on Kontext?

Learn more

More from Kontext

visibility 20
thumb_up 0
access_time 19 days ago

When installing a vanilla Spark on Windows or Linux, you may encounter the following error to invoke spark-sql command: Error: Failed to load class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver This error usually occurs when installing a Spark version without built-in Hadoop ...

.NET for Apache Spark v1.0.0 Released
visibility 26
thumb_up 0
access_time 3 months ago

.NET for Apache Spark v1.0.0 was released officially on 2020-10-14. This page summarizes some important resources for you to get started on .NET for Spark. *Image credit: https://github.com/dotnet/spark/raw/master/docs/img/dotnetsparklogo-6.png Release Notes on GitHub ...

visibility 8644
thumb_up 0
access_time 2 years ago

This pages summarizes the steps to install the latest version 2.4.3 of Apache Spark on Windows 10 via Windows Subsystem for Linux (WSL). Follow either of the following pages to install WSL in a system or non-system drive on your Windows 10. Install Windows Subsystem for Linux on a Non-System ...