🚀 News: We are launching the Kontext Labs Platform Pilot! Click here to join our pilot program.

Delete or Remove Columns from PySpark DataFrame

This article shows how to 'delete' column from Spark data frame using Python.

Construct a dataframe

Follow article Convert Python Dictionary List to PySpark DataFrame to construct a dataframe.

+----------+---+------+
|  Category| ID| Value|
+----------+---+------+
|Category A|  1| 12.40|
|Category B|  2| 30.10|
|Category C|  3|100.01|
+----------+---+------+

'Delete' or 'Remove' one column

The word 'delete' or 'remove' can be misleading as Spark is lazy evaluated.

We can use dropfunction to remove or delete columns from a DataFrame.

df1 = df.drop('Category')
df1.show()

Output:

+---+------+
| ID| Value|
+---+------+
|  1| 12.40|
|  2| 30.10|
|  3|100.01|
+---+------+

Drop multiple columns

Multiple columns can be dropped at the same time:

df2 = df.drop('Category', 'ID')
df2.show()

columns_to_drop = ['Category', 'ID']
df3 = df.drop(*columns_to_drop)
df3.show()

Output:

+------+
| Value|
+------+
| 12.40|
| 30.10|
|100.01|
+------+

+------+
| Value|
+------+
| 12.40|
| 30.10|
|100.01|
+------+

Run Spark code

You can easily run Spark code on your Windows or UNIX-alike (Linux, MacOS) systems. Follow these articles to setup your Spark environment if you don't have one yet: