Raymond Raymond

Spark - Check if Array Column Contains Specific Value

event 2021-05-22 visibility 4,956 comment 0 insights toc
more_vert
insights Stats
toc Table of contents

Spark DataFrames supports complex data types like array. This code snippet provides one example to check whether specific value exists in an array column using array_contains function.

Code snippet

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, IntegerType, StringType, StructField, StructType
from pyspark.sql.functions import array_contains

appName = "PySpark Example - array_contains"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# Sample data
data = [(1, ['apple', 'pear', 'kiwi']), (2, ['apple']), (3, ['pear', 'berry'])]

# schema
schema = StructType([StructField("ID", IntegerType(), True),
                     StructField("Tags", ArrayType(StringType()), True)])

# Create Spark DaraFrame from pandas DataFrame
df = spark.createDataFrame(data, schema)
print(df.schema)
df.show()

# Show records contain apple in Tags column only
df.where(array_contains('Tags', 'apple')).show()

# Show records don't contain apple in Tags column only
df.where(array_contains('Tags', 'apple') == False).show()

spark.stop()
The code snippet constructs a Spark DataFrame using data in memory. The schema looks like the following:
StructType(List(StructField(ID,IntegerType,true),StructField(Tags,ArrayType(StringType,true),true)))

The output:

+---+-------------------+
| ID|               Tags|
+---+-------------------+
|  1|[apple, pear, kiwi]|
|  2|            [apple]|
|  3|      [pear, berry]|
+---+-------------------+

+---+-------------------+
| ID|               Tags|
+---+-------------------+
|  1|[apple, pear, kiwi]|
|  2|            [apple]|
+---+-------------------+

+---+-------------+
| ID|         Tags|
+---+-------------+
|  3|[pear, berry]|
+---+-------------+

The second result prints out the records with word 'apple' in Tags array column; the third one prints out the ones without.

References

Spark SQL - Array Functions - Kontext

More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts