Spark - Check if Array Column Contains Specific Value
Spark DataFrames supports complex data types like array. This code snippet provides one example to check whether specific value exists in an array column using array_contains function.
Code snippet
from pyspark.sql import SparkSession from pyspark.sql.types import ArrayType, IntegerType, StringType, StructField, StructType from pyspark.sql.functions import array_contains appName = "PySpark Example - array_contains" master = "local" # Create Spark session spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .getOrCreate() # Sample data data = [(1, ['apple', 'pear', 'kiwi']), (2, ['apple']), (3, ['pear', 'berry'])] # schema schema = StructType([StructField("ID", IntegerType(), True), StructField("Tags", ArrayType(StringType()), True)]) # Create Spark DaraFrame from pandas DataFrame df = spark.createDataFrame(data, schema) print(df.schema) df.show() # Show records contain apple in Tags column only df.where(array_contains('Tags', 'apple')).show() # Show records don't contain apple in Tags column only df.where(array_contains('Tags', 'apple') == False).show() spark.stop()
The code snippet constructs a Spark DataFrame using data in memory. The schema looks like the following:
StructType(List(StructField(ID,IntegerType,true),StructField(Tags,ArrayType(StringType,true),true)))
The output:
+---+-------------------+ | ID| Tags| +---+-------------------+ | 1|[apple, pear, kiwi]| | 2| [apple]| | 3| [pear, berry]| +---+-------------------+ +---+-------------------+ | ID| Tags| +---+-------------------+ | 1|[apple, pear, kiwi]| | 2| [apple]| +---+-------------------+ +---+-------------+ | ID| Tags| +---+-------------+ | 3|[pear, berry]| +---+-------------+
The second result prints out the records with word 'apple' in Tags array column; the third one prints out the ones without.
References
Spark SQL - Array Functions - Kontext
copyright
This page is subject to Site terms.
comment Comments
No comments yet.