Replace Values via regexp_replace Function in PySpark DataFrame

Kontext Kontext event 2022-08-16 visibility 3,550
more_vert

Code description

PySpark SQL APIs provides regexp_replace built-in function to replace string values that match with the specified regular expression.

It takes three parameters: the input column of the DataFrame, regular expression and the replacement for matches.

pyspark.sql.functions.regexp_replace(str, pattern, replacement)

Output

The following is the output from this code snippet:

+--------------+-------+----------------+
|       str_col|int_col|str_col_replaced|
+--------------+-------+----------------+
|Hello Kontext!|    100|  Hello kontext!|
|Hello Context!|    100|  Hello kontext!|
+--------------+-------+----------------+

All uppercase 'K' or 'C' are replaced with lowercase 'k'.

Code snippet

from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_replace

app_name = "PySpark regex sql functions"
master = "local"

spark = SparkSession.builder \
    .appName(app_name) \
    .master(master) \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

# Create a DataFrame
df = spark.createDataFrame(
    [['Hello Kontext!', 100], ['Hello Context!', 100]], ['str_col', 'int_col'])

# Replace str_col with regular expressions
df = df.withColumn('str_col_replaced',
                   regexp_replace('str_col', r'[C|K]', 'k'))

df.show()
More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts