Replace Values via regexp_replace Function in PySpark DataFrame

Code description

PySpark SQL APIs provides regexp_replace built-in function to replace string values that match with the specified regular expression.

It takes three parameters: the input column of the DataFrame, regular expression and the replacement for matches.

    pyspark.sql.functions.regexp_replace(str, pattern, replacement)

Output

The following is the output from this code snippet:

    +--------------+-------+----------------+
    |       str_col|int_col|str_col_replaced|
    +--------------+-------+----------------+
    |Hello Kontext!|    100|  Hello kontext!|
    |Hello Context!|    100|  Hello kontext!|
    +--------------+-------+----------------+

All uppercase 'K' or 'C' are replaced with lowercase 'k'.

Code snippet

    from pyspark.sql import SparkSession
    from pyspark.sql.functions import regexp_replace
    
    app_name = "PySpark regex sql functions"
    master = "local"
    
    spark = SparkSession.builder         .appName(app_name)         .master(master)         .getOrCreate()
    
    spark.sparkContext.setLogLevel("WARN")
    
    # Create a DataFrame
    df = spark.createDataFrame(
        [['Hello Kontext!', 100], ['Hello Context!', 100]], ['str_col', 'int_col'])
    
    # Replace str_col with regular expressions
    df = df.withColumn('str_col_replaced',
                       regexp_replace('str_col', r'[C|K]', 'k'))
    
    df.show()

Replace Values via regexp_replace Function in PySpark DataFrame

Code description

Output

Code snippet

In this article