Spark DataFrames supports complex data types like array. This code snippet provides one example to check whether specific value exists in an array column using array_contains function.
Code snippet
from pyspark.sql import SparkSession from pyspark.sql.types import ArrayType, IntegerType, StringType, StructField, StructType from pyspark.sql.functions import array_contains appName = "PySpark Example - array_contains" master = "local" # Create Spark session spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .getOrCreate() # Sample data data = [(1, ['apple', 'pear', 'kiwi']), (2, ['apple']), (3, ['pear', 'berry'])] # schema schema = StructType([StructField("ID", IntegerType(), True), StructField("Tags", ArrayType(StringType()), True)]) # Create Spark DaraFrame from pandas DataFrame df = spark.createDataFrame(data, schema) print(df.schema) df.show() # Show records contain apple in Tags column only df.where(array_contains('Tags', 'apple')).show() # Show records don't contain apple in Tags column only df.where(array_contains('Tags', 'apple') == False).show() spark.stop()
The code snippet constructs a Spark DataFrame using data in memory. The schema looks like the following:
StructType(List(StructField(ID,IntegerType,true),StructField(Tags,ArrayType(StringType,true),true)))
The output:
+---+-------------------+ | ID| Tags| +---+-------------------+ | 1|[apple, pear, kiwi]| | 2| [apple]| | 3| [pear, berry]| +---+-------------------+ +---+-------------------+ | ID| Tags| +---+-------------------+ | 1|[apple, pear, kiwi]| | 2| [apple]| +---+-------------------+ +---+-------------+ | ID| Tags| +---+-------------+ | 3|[pear, berry]| +---+-------------+
The second result prints out the records with word 'apple' in Tags array column; the third one prints out the ones without.