Raymond Raymond | Spark & PySpark

Scala: Remove Columns from Spark Data Frame

event 2020-12-13 visibility 4,262 comment 0 insights toc
insights Stats

This article shows how to 'remove' column from Spark data frame using Scala. 

Construct a dataframe 

Follow article Scala: Convert List to Spark Data Frame to construct a data frame.

The DataFrame object looks like the following: 

|  Category|Count|       Description|
|Category A|  100|This is category A|
|Category B|  120|This is category B|
|Category C|  150|This is category C|

'Delete' or 'Remove' one column

The word 'delete' or 'remove' can be misleading as Spark is lazy evaluated. 

We can use drop function to remove or delete columns from a DataFrame.

scala> df.drop("Category").show()
|Count|       Description|
|  100|This is category A|
|  120|This is category B|
|  150|This is category C|

Drop multiple columns

Multiple columns can be dropped at the same time:

val columns_to_drop = Array("Category", "Count")
df.drop(columns_to_drop: _*).show()
df.drop("Category", "Description").show()
scala> df.drop(columns_to_drop: _*).show()
|       Description|
|This is category A|
|This is category B|
|This is category C|

scala> df.drop("Category", "Description").show()
|  100|
|  120|
|  150|

The above code snippets shows two approaches to drop column - specified column names or dynamic array or column names. 

Run Spark code

You can easily run Spark code on your Windows or UNIX-alike (Linux, MacOS) systems. Follow these articles to setup your Spark environment if you don't have one yet:

More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts