Remove Special Characters from Column in PySpark DataFrame

Kontext Kontext event 2022-08-19 visibility 20,683
more_vert

Code description

Spark SQL function regex_replace can be used to remove special characters from a string column in Spark DataFrame. Depends on the definition of special characters, the regular expressions can vary. For instance, [^0-9a-zA-Z_\-]+ can be used to match characters that are not alphanumeric or are not hyphen(-) or underscore(_); regular expression '[@\+\#\$\%\^\!]+' can match these defined special characters.

This code snippet replace special characters with empty string.

Output:

+---+--------------------------+
|id |str                       |
+---+--------------------------+
|1  |ABCDEDF!@#$%%^123456qwerty|
|2  |ABCDE!!!                  |
+---+--------------------------+

+---+-------------------+
| id|       replaced_str|
+---+-------------------+
|  1|ABCDEDF123456qwerty|
|  2|              ABCDE|
+---+-------------------+

Code snippet

from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_replace

app_name = "PySpark regex_replace Example"
master = "local"

spark = SparkSession.builder \
    .appName(app_name) \
    .master(master) \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

data = [[1, 'ABCDEDF!@#$%%^123456qwerty'],
        [2, 'ABCDE!!!']
        ]

df = spark.createDataFrame(data, ['id', 'str'])

df.show(truncate=False)

df = df.select("id", regexp_replace("str", "[^0-9a-zA-Z_\-]+", ""
                                    ).alias('replaced_str'))

df.show()
More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts