Scala: Change Data Frame Column Names in Spark
Column renaming is a common action when working with data frames. In this article, I will show you how to rename column names in a Spark data frame using Scala.
Construct a dataframe
The following code snippet creates a DataFrame from an array of Scala list. Spark SQL types are used to create the schema and then SparkSession.createDataFrame function is used to convert the array of list to a Spark DataFrame object.
import org.apache.spark.sql._ import org.apache.spark.sql.types._ val data = Array(List("Category A", 100, "This is category A"), List("Category B", 120, "This is category B"), List("Category C", 150, "This is category C")) // Create a schema for the dataframe val schema = StructType( StructField("Category", StringType, true) :: StructField("Count", IntegerType, true) :: StructField("Description", StringType, true) :: Nil) // Convert list to List of Row val rows = data.map(t=>Row(t(0),t(1),t(2))).toList // Create RDD val rdd = spark.sparkContext.parallelize(rows) // Create data frame val df = spark.createDataFrame(rdd,schema) print(df.schema) df.show()
The data frame looks like the following:
+----------+-----+------------------+ | Category|Count| Description| +----------+-----+------------------+ |Category A| 100|This is category A| |Category B| 120|This is category B| |Category C| 150|This is category C| +----------+-----+------------------+
Print out column names
DataFrame.columns can be used to print out column list of the data frame:
print(df.columns.toList)
Output:
List(Category, Count, Description)
Rename one column
We can use withColumnRenamed function to change column names.
val df2 = df.withColumnRenamed("Category", "category_new") df2.show()
Output:
scala> df2.show() +------------+-----+------------------+ |category_new|Count| Description| +------------+-----+------------------+ | Category A| 100|This is category A| | Category B| 120|This is category B| | Category C| 150|This is category C| +------------+-----+------------------+
Column Category is renamed to category_new.
Rename all columns
Function toDF can be used to rename all column names. The following code snippet converts all column names to lower case and then append '_new' to each column name.
# Rename columns val new_column_names=df.columns.map(c=>c.toLowerCase() + "_new") val df3 = df.toDF(new_column_names:_*) df3.show()
Output:
scala> df3.show() +------------+---------+------------------+ |category_new|count_new| description_new| +------------+---------+------------------+ | Category A| 100|This is category A| | Category B| 120|This is category B| | Category C| 150|This is category C| +------------+---------+------------------+
You can use similar approach to remove spaces or special characters from column names.
Use Spark SQL
Of course, you can also use Spark SQL to rename columns like the following code snippet shows:
df.createOrReplaceTempView("df") spark.sql("select Category as category_new, Count as count_new, Description as description_new from df").show()
The above code snippet first register the dataframe as a temp view. And then Spark SQL is used to change column names.
Output:
scala> spark.sql("select Category as category_new, Count as count_new, Description as description_new from df").show() +------------+---------+------------------+ |category_new|count_new| description_new| +------------+---------+------------------+ | Category A| 100|This is category A| | Category B| 120|This is category B| | Category C| 150|This is category C| +------------+---------+------------------+
Run Spark code
You can easily run Spark code on your Windows or UNIX-alike (Linux, MacOS) systems. Follow these articles to setup your Spark environment if you don't have one yet: