Rename DataFrame Column Names in PySpark

event 2020-08-09 thumb_up 0 visibility 15,414 comment 0 insights toc

more_vert

warning Please login first to view stats information.

Construct a dataframe
Print out column names
Rename one column
Rename all columns
Use Spark SQL
Run Spark code

Column renaming is a common action when working with data frames. In this article, I will show you how to change column names in a Spark data frame using Python. The frequently used method is withColumnRenamed.

Construct a dataframe

The following code snippet creates a DataFrame from a Python native dictionary list. PySpark SQL types are used to create the schema and then SparkSession.createDataFrame function is used to convert the dictionary list to a Spark DataFrame.

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType, DecimalType
from decimal import Decimal

appName = "Python Example - PySpark Rename DataFrame Column Names"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# List
data = [{"Category": 'Category A', "ID": 1, "Value": Decimal(12.40)},
        {"Category": 'Category B', "ID": 2, "Value": Decimal(30.10)},
        {"Category": 'Category C', "ID": 3, "Value": Decimal(100.01)}
        ]
# Schema        
schema = StructType([
    StructField('Category', StringType(), False),
    StructField('ID', IntegerType(), False),
    StructField('Value', DecimalType(scale=2), True)
])


# Create data frame
df = spark.createDataFrame(data, schema)
df.show()

The content looks like the following:

+----------+---+------+
|  Category| ID| Value|
+----------+---+------+
|Category A|  1| 12.40|
|Category B|  2| 30.10|
|Category C|  3|100.01|
+----------+---+------+

Print out column names

DataFrame.columns can be used to print out column list of the data frame:

print(df.columns)

Output:

['Category', 'ID', 'Value']

Rename one column

We can use withColumnRenamed function to change column names.

df = df.withColumnRenamed('Category', 'category_new')
df.show()

Output:

+------------+---+------+
|category_new| ID| Value|
+------------+---+------+
|  Category A|  1| 12.40|
|  Category B|  2| 30.10|
|  Category C|  3|100.01|
+------------+---+------+

Column Category is renamed to category_new.

Rename all columns

Function toDF can be used to rename all column names. The following code snippet converts all column names to lower case and then append '_new' to each column name.

# Rename columns
new_column_names = [f"{c.lower()}_new" for c in df.columns]
df = df.toDF(*new_column_names)
df.show()

Output:

+------------+------+---------+
|category_new|id_new|value_new|
+------------+------+---------+
|  Category A|     1|    12.40|
|  Category B|     2|    30.10|
|  Category C|     3|   100.01|
+------------+------+---------+

You can use similar approach to remove spaces or special characters from column names.

Use Spark SQL

Of course, you can also use Spark SQL to rename columns like the following code snippet shows:

df.createOrReplaceTempView("df")
spark.sql("select Category as category_new, ID as id_new, Value as value_new from df").show()

The above code snippet first register the dataframe as a temp view. And then Spark SQL is used to change column names.

Output:

+------------+------+---------+
|category_new|id_new|value_new|
+------------+------+---------+
|  Category A|     1|    12.40|
|  Category B|     2|    30.10|
|  Category C|     3|   100.01|
+------------+------+---------+

Run Spark code

You can easily run Spark code on your Windows or UNIX-alike (Linux, MacOS) systems. Follow these articles to setup your Spark environment if you don't have one yet:

Rename DataFrame Column Names in PySpark

insights Stats

toc Table of contents

Construct a dataframe

Print out column names

Rename one column

Rename all columns

Use Spark SQL

Run Spark code

Log in with external accounts