Iterate through PySpark DataFrame Rows via foreach

Code description

DataFrame.foreach can be used to iterate/loop through each row (pyspark.sql.types.Row) in a Spark DataFrame object and apply a function to all the rows. This method is a shorthand for DataFrame.rdd.foreach.

Note: Please be cautious when using this method especially if your DataFrame is big.

Output:

    +-----+--------+
    | col1|    col2|
    +-----+--------+
    |Hello| Kontext|
    |Hello|Big Data|
    +-----+--------+
    
    col1=Hello, col2=Kontext
    col1=Hello, col2=Big Data

Code snippet

    from pyspark.sql import SparkSession
    
    app_name = "PySpark foreach Example"
    master = "local"
    
    spark = SparkSession.builder         .appName(app_name)         .master(master)         .getOrCreate()
    
    spark.sparkContext.setLogLevel("WARN")
    
    # Create a DataFrame
    df = spark.createDataFrame(
        [['Hello', 'Kontext'], ['Hello', 'Big Data']], ['col1', 'col2'])
    
    df.show()
    
    
    def print_row(row):
        print(f'col1={row.col1}, col2={row.col2}')
    
    
    # Apply print_row to each row
    df.foreach(print_row)

Code description

Code snippet

In this article