Change DataFrame Column Names in PySpark

access_time 4 months ago visibility1411 comment 0

Column renaming is a common action when working with data frames. In this article, I will show you how to rename column names in a Spark data frame using Python. 

Construct a dataframe 

The following code snippet creates a DataFrame from a Python native dictionary list. PySpark SQL types are used to create the schema and then SparkSession.createDataFrame function is used to convert the dictionary list to a Spark DataFrame. 

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType, DecimalType
from decimal import Decimal

appName = "Python Example - PySpark Rename DataFrame Column Names"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# List
data = [{"Category": 'Category A', "ID": 1, "Value": Decimal(12.40)},
        {"Category": 'Category B', "ID": 2, "Value": Decimal(30.10)},
        {"Category": 'Category C', "ID": 3, "Value": Decimal(100.01)}
        ]
# Schema        
schema = StructType([
    StructField('Category', StringType(), False),
    StructField('ID', IntegerType(), False),
    StructField('Value', DecimalType(scale=2), True)
])


# Create data frame
df = spark.createDataFrame(data, schema)
df.show()

The content looks like the following:

+----------+---+------+
|  Category| ID| Value|
+----------+---+------+
|Category A|  1| 12.40|
|Category B|  2| 30.10|
|Category C|  3|100.01|
+----------+---+------+

Print out column names

DataFrame.columns can be used to print out column list of the data frame:

print(df.columns)

Output:

['Category', 'ID', 'Value']

Rename one column

We can use withColumnRenamed function to change column names.

df = df.withColumnRenamed('Category', 'category_new')
df.show()

Output:

+------------+---+------+
|category_new| ID| Value|
+------------+---+------+
|  Category A|  1| 12.40|
|  Category B|  2| 30.10|
|  Category C|  3|100.01|
+------------+---+------+

Column Category is renamed to category_new.

Rename all columns

Function toDF can be used to rename all column names. The following code snippet converts all column names to lower case and then append '_new' to each column name.

# Rename columns
new_column_names = [f"{c.lower()}_new" for c in df.columns]
df = df.toDF(*new_column_names)
df.show()

Output:

+------------+------+---------+
|category_new|id_new|value_new|
+------------+------+---------+
|  Category A|     1|    12.40|
|  Category B|     2|    30.10|
|  Category C|     3|   100.01|
+------------+------+---------+

You can use similar approach to remove spaces or special characters from column names.

Use Spark SQL

Of course, you can also use Spark SQL to rename columns like the following code snippet shows:

df.createOrReplaceTempView("df")
spark.sql("select Category as category_new, ID as id_new, Value as value_new from df").show()

The above code snippet first register the dataframe as a temp view. And then Spark SQL is used to change column names.

Output:

+------------+------+---------+
|category_new|id_new|value_new|
+------------+------+---------+
|  Category A|     1|    12.40|
|  Category B|     2|    30.10|
|  Category C|     3|   100.01|
+------------+------+---------+

Run Spark code

You can easily run Spark code on your Windows or UNIX-alike (Linux, MacOS) systems. Follow these articles to setup your Spark environment if you don't have one yet:

info Last modified by Administrator at 4 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Kontext Column

Created for everyone to publish data, programming and cloud related articles.
Follow three steps to create your columns.


Learn more arrow_forward

More from Kontext

local_offer .NET local_offer spark local_offer parquet local_offer hive local_offer dotnetcore

visibility 1797
thumb_up 0
access_time 2 years ago

I’ve been following Mobius project for a while and have been waiting for this day. .NET for Apache Spark v0.1.0 was just published on 2019-04-25 on GitHub. It provides high performance APIs for programming Apache Spark applications with C# and F#. It is .NET Standard complaint and can run in ...

Install Apache Spark 3.0.0 on Windows 10

local_offer spark local_offer pyspark local_offer windows10 local_offer big-data-on-windows-10

visibility 1294
thumb_up 1
access_time 4 months ago

Spark 3.0.0 was release on 18th June 2020 with many new features. The highlights of features include adaptive query execution, dynamic partition pruning, ANSI SQL compliance, significant improvements in pandas APIs, new UI for structured streaming, up to 40x speedups for calling R user-defined ...

local_offer tutorial local_offer pyspark local_offer spark local_offer how-to local_offer spark-dataframe

visibility 2410
thumb_up 0
access_time 4 months ago

This article shows how to change column types of Spark DataFrame using Python. For example, convert StringType to DoubleType, StringType to Integer, StringType to DateType. Follow article  Convert Python Dictionary List to PySpark DataFrame to construct a dataframe.

About column

Apache Spark installation guides, performance tuning tips, general tutorials, etc.

*Spark logo is a registered trademark of Apache Spark.

rss_feed Subscribe RSS