Scala - Add Constant Column to Spark Data Frame

access_time 2 months ago visibility24 comment 0

This article shows how to add a constant or literal column to Spark data frame using Scala. 

Construct a dataframe 

Follow article Scala: Convert List to Spark Data Frame to construct a Spark data frame.

+----------+-----+------------------+
|  Category|Count|       Description|
+----------+-----+------------------+
|Category A|  100|This is category A|
|Category B|  120|This is category B|
|Category C|  150|This is category C|
+----------+-----+------------------+

Add constant column via lit function

Function lit can be used to add columns with constant value as the following code snippet shows:

df.withColumn("ConstantColumn1", lit(1)).withColumn("ConstantColumn2", lit(java.time.LocalDate.now)).show()

Two new columns are added. 

Output:

scala> df.withColumn("ConstantColumn1", lit(1)).withColumn("ConstantColumn2", lit(java.time.LocalDate.now)).show()
+----------+-----+------------------+---------------+---------------+
|  Category|Count|       Description|ConstantColumn1|ConstantColumn2|
+----------+-----+------------------+---------------+---------------+
|Category A|  100|This is category A|              1|     2020-12-14|
|Category B|  120|This is category B|              1|     2020-12-14|
|Category C|  150|This is category C|              1|     2020-12-14|
+----------+-----+------------------+---------------+---------------+

Other approaches

UDF or Spark SQL can be used to add constant values too.

The following are some examples. 

# Add new constant column via Spark SQL
df.createOrReplaceTempView("df")
spark.sql(
    "select *, 1 as ConstantColumn1, current_date as ConstantColumn2 from df").show()

# Add new constant column via UDF
val constantFunc = udf(()=> 1)
df.withColumn("ConstantColumn1", constantFunc()).show()

Output:

scala> df.createOrReplaceTempView("df")
spark.sql(

scala>      |     "select *, 1 as ConstantColumn1, current_date as ConstantColumn2 from df").show()
+----------+-----+------------------+---------------+---------------+
|  Category|Count|       Description|ConstantColumn1|ConstantColumn2|
+----------+-----+------------------+---------------+---------------+
|Category A|  100|This is category A|              1|     2020-12-14|
|Category B|  120|This is category B|              1|     2020-12-14|
|Category C|  150|This is category C|              1|     2020-12-14|
+----------+-----+------------------+---------------+---------------+

scala> df.withColumn("ConstantColumn1", constantFunc()).show()
+----------+-----+------------------+---------------+
|  Category|Count|       Description|ConstantColumn1|
+----------+-----+------------------+---------------+
|Category A|  100|This is category A|              1|
|Category B|  120|This is category B|              1|
|Category C|  150|This is category C|              1|
+----------+-----+------------------+---------------+

Run Spark code

You can easily run Spark code on your Windows or UNIX-alike (Linux, MacOS) systems. Follow these articles to setup your Spark environment if you don't have one yet:

info Last modified by Raymond 2 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Follow Kontext

Get our latest updates on LinkedIn or Twitter.

Want to publish your article on Kontext?

Learn more

More from Kontext

visibility 3558
thumb_up 0
access_time 3 years ago

This page shows how to import data from SQL Server into Hadoop via Apache Sqoop. Please follow the link below to install Sqoop in your machine if you don’t have one environment ready. Install Apache Sqoop in Windows Use the following command in Command Prompt, you will be able to find out ...

visibility 2497
thumb_up 0
access_time 6 months ago

Column renaming is a common action when working with data frames. In this article, I will show you how to rename column names in a Spark data frame using Python.  The following code snippet creates a DataFrame from a Python native dictionary list. PySpark SQL types are used to create the ...

visibility 26
thumb_up 0
access_time 28 days ago

In my article Connect to Teradata database through Python , I demonstrated about how to use Teradata python package or Teradata ODBC driver to connect to Teradata. In this article, I’m going to show you how to connect to Teradata through JDBC drivers so that you can load data directly into Spark ...