Kontext Copilot - An AI-powered assistant for data analytics that runs on your local computer. Learn more
Get started
Remove Special Characters from Column in PySpark DataFrame
insights Stats
warning Please login first to view stats information.
Kontext
Code Snippets & Tips
Code snippets and tips for various programming languages/frameworks. All code examples are under MIT or Apache 2.0 license unless specified otherwise.
Code description
Spark SQL function regex_replace
can be used to remove special characters from a string column in Spark DataFrame. Depends on the definition of special characters, the regular expressions can vary. For instance, [^0-9a-zA-Z_\-]+
can be used to match characters that are not alphanumeric or are not hyphen(-) or underscore(_); regular expression '[@\+\#\$\%\^\!]+
' can match these defined special characters.
This code snippet replace special characters with empty string.
Output:
+---+--------------------------+ |id |str | +---+--------------------------+ |1 |ABCDEDF!@#$%%^123456qwerty| |2 |ABCDE!!! | +---+--------------------------+ +---+-------------------+ | id| replaced_str| +---+-------------------+ | 1|ABCDEDF123456qwerty| | 2| ABCDE| +---+-------------------+
Code snippet
from pyspark.sql import SparkSession from pyspark.sql.functions import regexp_replace app_name = "PySpark regex_replace Example" master = "local" spark = SparkSession.builder \ .appName(app_name) \ .master(master) \ .getOrCreate() spark.sparkContext.setLogLevel("WARN") data = [[1, 'ABCDEDF!@#$%%^123456qwerty'], [2, 'ABCDE!!!'] ] df = spark.createDataFrame(data, ['id', 'str']) df.show(truncate=False) df = df.select("id", regexp_replace("str", "[^0-9a-zA-Z_\-]+", "" ).alias('replaced_str')) df.show()
copyright
This page is subject to Site terms.
comment Comments
No comments yet.