Remove Special Characters from Column in PySpark DataFrame
Code description
Spark SQL function regex_replace
can be used to remove special characters from a string column in Spark DataFrame. Depends on the definition of special characters, the regular expressions can vary. For instance, [^0-9a-zA-Z_\-]+
can be used to match characters that are not alphanumeric or are not hyphen(-) or underscore(_); regular expression '[@\+\#\$\%\^\!]+
' can match these defined special characters.
This code snippet replace special characters with empty string.
Output:
+---+--------------------------+ |id |str | +---+--------------------------+ |1 |ABCDEDF!@#$%%^123456qwerty| |2 |ABCDE!!! | +---+--------------------------+ +---+-------------------+ | id| replaced_str| +---+-------------------+ | 1|ABCDEDF123456qwerty| | 2| ABCDE| +---+-------------------+
Code snippet
from pyspark.sql import SparkSession from pyspark.sql.functions import regexp_replace app_name = "PySpark regex_replace Example" master = "local" spark = SparkSession.builder \ .appName(app_name) \ .master(master) \ .getOrCreate() spark.sparkContext.setLogLevel("WARN") data = [[1, 'ABCDEDF!@#$%%^123456qwerty'], [2, 'ABCDE!!!'] ] df = spark.createDataFrame(data, ['id', 'str']) df.show(truncate=False) df = df.select("id", regexp_replace("str", "[^0-9a-zA-Z_\-]+", "" ).alias('replaced_str')) df.show()
copyright
This page is subject to Site terms.
comment Comments
No comments yet.