Iterate through PySpark DataFrame Rows via foreach

Kontext Kontext event 2022-08-19 visibility 2,451
more_vert

Code description

DataFrame.foreach can be used to iterate/loop through each row (pyspark.sql.types.Row) in a Spark DataFrame object and apply a function to all the rows. This method is a shorthand for DataFrame.rdd.foreach.

Note: Please be cautious when using this method especially if your DataFrame is big.

Output:

+-----+--------+
| col1|    col2|
+-----+--------+
|Hello| Kontext|
|Hello|Big Data|
+-----+--------+

col1=Hello, col2=Kontext
col1=Hello, col2=Big Data

Code snippet

from pyspark.sql import SparkSession

app_name = "PySpark foreach Example"
master = "local"

spark = SparkSession.builder \
    .appName(app_name) \
    .master(master) \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

# Create a DataFrame
df = spark.createDataFrame(
    [['Hello', 'Kontext'], ['Hello', 'Big Data']], ['col1', 'col2'])

df.show()


def print_row(row):
    print(f'col1={row.col1}, col2={row.col2}')


# Apply print_row to each row
df.foreach(print_row)
More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts