Iterate through PySpark DataFrame Rows via foreach
Code description
DataFrame.foreach
can be used to iterate/loop through each row (pyspark.sql.types.Row
) in a Spark DataFrame object and apply a function to all the rows. This method is a shorthand for DataFrame.rdd.foreach
.
Note: Please be cautious when using this method especially if your DataFrame is big.
Output:
+-----+--------+ | col1| col2| +-----+--------+ |Hello| Kontext| |Hello|Big Data| +-----+--------+ col1=Hello, col2=Kontext col1=Hello, col2=Big Data
Code snippet
from pyspark.sql import SparkSession app_name = "PySpark foreach Example" master = "local" spark = SparkSession.builder \ .appName(app_name) \ .master(master) \ .getOrCreate() spark.sparkContext.setLogLevel("WARN") # Create a DataFrame df = spark.createDataFrame( [['Hello', 'Kontext'], ['Hello', 'Big Data']], ['col1', 'col2']) df.show() def print_row(row): print(f'col1={row.col1}, col2={row.col2}') # Apply print_row to each row df.foreach(print_row)
copyright
This page is subject to Site terms.
comment Comments
No comments yet.