Code description
Spark SQL function regex_replace
can be used to remove special characters from a string column in Spark DataFrame. Depends on the definition of special characters, the regular expressions can vary. For instance, [^0-9a-zA-Z_\-]+
can be used to match characters that are not alphanumeric or are not hyphen(-) or underscore(_); regular expression '[@\+\#\$\%\^\!]+
' can match these defined special characters.
This code snippet replace special characters with empty string.
Output:
+---+--------------------------+
|id |str |
+---+--------------------------+
|1 |ABCDEDF!@#$%%^123456qwerty|
|2 |ABCDE!!! |
+---+--------------------------+
+---+-------------------+
| id| replaced_str|
+---+-------------------+
| 1|ABCDEDF123456qwerty|
| 2| ABCDE|
+---+-------------------+
Code snippet
from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_replace
app_name = "PySpark regex_replace Example"
master = "local"
spark = SparkSession.builder .appName(app_name) .master(master) .getOrCreate()
spark.sparkContext.setLogLevel("WARN")
data = [[1, 'ABCDEDF!@#$%%^123456qwerty'],
[2, 'ABCDE!!!']
]
df = spark.createDataFrame(data, ['id', 'str'])
df.show(truncate=False)
df = df.select("id", regexp_replace("str", "[^0-9a-zA-Z_\-]+", ""
).alias('replaced_str'))
df.show()