Spark - Check if Array Column Contains Specific Value

Spark DataFrames supports complex data types like array. This code snippet provides one example to check whether specific value exists in an array column using array_containsfunction.

Code snippet

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, IntegerType, StringType, StructField, StructType
from pyspark.sql.functions import array_contains

appName = "PySpark Example - array_contains"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# Sample data
data = [(1, ['apple', 'pear', 'kiwi']), (2, ['apple']), (3, ['pear', 'berry'])]

# schema
schema = StructType([StructField("ID", IntegerType(), True),
                     StructField("Tags", ArrayType(StringType()), True)])

# Create Spark DaraFrame from pandas DataFrame
df = spark.createDataFrame(data, schema)
print(df.schema)
df.show()

# Show records contain apple in Tags column only
df.where(array_contains('Tags', 'apple')).show()

# Show records don't contain apple in Tags column only
df.where(array_contains('Tags', 'apple') == False).show()

spark.stop()

The code snippet constructs a Spark DataFrame using data in memory. The schema looks like the following:

StructType(List(StructField(ID,IntegerType,true),StructField(Tags,ArrayType(StringType,true),true)))

The output:

+---+-------------------+
| ID|               Tags|
+---+-------------------+
|  1|[apple, pear, kiwi]|
|  2|            [apple]|
|  3|      [pear, berry]|
+---+-------------------+

+---+-------------------+
| ID|               Tags|
+---+-------------------+
|  1|[apple, pear, kiwi]|
|  2|            [apple]|
+---+-------------------+

+---+-------------+
| ID|         Tags|
+---+-------------+
|  3|[pear, berry]|
+---+-------------+

The second result prints out the records with word 'apple' in Tagsarray column; the third one prints out the ones without.

References

Spark SQL - Array Functions - Kontext