from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_replace
app_name = "PySpark regex_replace Example"
master = "local"
spark = SparkSession.builder \
.appName(app_name) \
.master(master) \
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
data = [[1, 'ABCDEDF!@#$%%^123456qwerty'],
[2, 'ABCDE!!!']
]
df = spark.createDataFrame(data, ['id', 'str'])
df.show(truncate=False)
df = df.select("id", regexp_replace("str", "[^0-9a-zA-Z_\-]+", ""
).alias('replaced_str'))
df.show()
visibility 9,522
comment 0
access_time 8 months ago
language English
Remove Special Characters from Column in PySpark DataFrame
Spark SQL function regex_replace
can be used to remove special characters from a string column in Spark DataFrame. Depends on the definition of special characters, the regular expressions can vary. For instance, [^0-9a-zA-Z_\-]+
can be used to match characters that are not alphanumeric or are not hyphen(-) or underscore(_); regular expression '[@\+\#\$\%\^\!]+
' can match these defined special characters.
This code snippet replace special characters with empty string.
Output:
+---+--------------------------+ |id |str | +---+--------------------------+ |1 |ABCDEDF!@#$%%^123456qwerty| |2 |ABCDE!!! | +---+--------------------------+ +---+-------------------+ | id| replaced_str| +---+-------------------+ | 1|ABCDEDF123456qwerty| | 2| ABCDE| +---+-------------------+
Code snippet
copyright
This page is subject to Site terms.
Log in with external accounts
comment Comments
No comments yet.
warning Please login first to view stats information.
code
PySpark DataFrame - Extract JSON Value using get_json_object Function
article
Extract Value from XML Column in PySpark DataFrame
article
PySpark - Fix PermissionError: [WinError 5] Access is denied
code
Use when() and otherwise() with PySpark DataFrame
article
Introduction to Hive Bucketed Table
Read more (115)