PySpark DataFrame - Filter Records using where and filter Function
more_vert
In Spark DataFrame, we can use where
or filter
to filter out unwanted records. Method where
is an alias for filter
. For filter conditions, we can use either SQL style or expressions (with Spark SQL functions) that return a true or false result. We can use &
or |
to specify multiple conditions in one filter.
Code snippet
The following script shows how to use filters.
from pyspark.sql import SparkSession import pyspark.sql.functions as F appName = "PySpark DataFrame - where or filter" master = "local" # Create Spark session spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .getOrCreate() spark.sparkContext.setLogLevel('WARN') data = [{"a": "100", "b": "200"}, {"a": "1000", "b": "2000"}] df = spark.createDataFrame(data) df.show() df.where("a > 100").show() df.filter(df.a > 100).show() df.where((df.b > 100) & (df.a == 1000)).show() df.where((F.col('b') > 100) & (F.col('a') == 1000)).show()
Output:
+----+----+ | a| b| +----+----+ | 100| 200| |1000|2000| +----+----+
*The results are the same for the above four filters.
Filter out null or none values
Refer to Filter Spark DataFrame Columns with None or Null Values.
copyright
This page is subject to Site terms.
comment Comments
No comments yet.
Log in with external accounts
warning Please login first to view stats information.
code
PySpark DataFrame - union, unionAll and unionByName
code
PySpark DataFrame - rank() and dense_rank() Functions
article
Introduction to PySpark ArrayType and MapType
article
Delete or Remove Columns from PySpark DataFrame
article
Save DataFrame to SQL Databases via JDBC in PySpark
Read more (116)