PySpark DataFrame - Filter Records using where and filter Function

event 2022-07-18 visibility 501 comment 0 insights
more_vert
insights Stats
Kontext Kontext Code Snippets & Tips

Code snippets and tips for various programming languages/frameworks. All code examples are under MIT or Apache 2.0 license unless specified otherwise. 

In Spark DataFrame, we can use where or filter to filter out unwanted records. Method where is an alias for filter. For filter conditions, we can use either SQL style or expressions (with Spark SQL functions) that return a true or false result. We can use & or | to specify multiple conditions in one filter. 

Code snippet

The following script shows how to use filters. 

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

appName = "PySpark DataFrame - where or filter"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

spark.sparkContext.setLogLevel('WARN')

data = [{"a": "100", "b": "200"},
        {"a": "1000", "b": "2000"}]

df = spark.createDataFrame(data)
df.show()

df.where("a > 100").show()
df.filter(df.a > 100).show()
df.where((df.b > 100) & (df.a == 1000)).show()
df.where((F.col('b') > 100) & (F.col('a') == 1000)).show()

Output:

+----+----+
|   a|   b|
+----+----+
| 100| 200|
|1000|2000|
+----+----+
*The results are the same for the above four filters. 

Filter out null or none values

Refer to Filter Spark DataFrame Columns with None or Null Values.

More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts