Kontext Kontext / Code Snippets & Tips

Remove Special Characters from Column in PySpark DataFrame

event 2022-08-19 visibility 19,450 comment 0 insights
insights Stats

Code description

Spark SQL function regex_replace can be used to remove special characters from a string column in Spark DataFrame. Depends on the definition of special characters, the regular expressions can vary. For instance, [^0-9a-zA-Z_\-]+ can be used to match characters that are not alphanumeric or are not hyphen(-) or underscore(_); regular expression '[@\+\#\$\%\^\!]+' can match these defined special characters.

This code snippet replace special characters with empty string.


|id |str                       |
|1  |ABCDEDF!@#$%%^123456qwerty|
|2  |ABCDE!!!                  |

| id|       replaced_str|
|  1|ABCDEDF123456qwerty|
|  2|              ABCDE|

Code snippet

from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_replace

app_name = "PySpark regex_replace Example"
master = "local"

spark = SparkSession.builder \
    .appName(app_name) \
    .master(master) \


data = [[1, 'ABCDEDF!@#$%%^123456qwerty'],
        [2, 'ABCDE!!!']

df = spark.createDataFrame(data, ['id', 'str'])


df = df.select("id", regexp_replace("str", "[^0-9a-zA-Z_\-]+", ""

More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts