PySpark split and explode example

Kontext Kontext event 2023-08-06 visibility 771
more_vert

Code description

This code snippet shows you how to define a function to split a string column to an array of strings using Python built-in split function. It then explodes the array element from the split into using PySpark built-in explode function.


Sample output

+----------+-----------------+--------------------+-----+
|  category|            users|         users_array| user|
+----------+-----------------+--------------------+-----+
|Category A|user1,user2,user3|[user1, user2, us...|user1|
|Category A|user1,user2,user3|[user1, user2, us...|user2|
|Category A|user1,user2,user3|[user1, user2, us...|user3|
|Category B|      user3,user4|      [user3, user4]|user3|
|Category B|      user3,user4|      [user3, user4]|user4|
+----------+-----------------+--------------------+-----+

Code snippet

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, ArrayType
from pyspark.sql.functions import udf, explode

appName = "PySpark Example - split"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# List
data = [{
    'category': 'Category A',
    'users': 'user1,user2,user3'
}, {
    'category': 'Category B',
    'users': 'user3,user4'
}]

# Create DF
df = spark.createDataFrame(data)
df.show()


@udf(ArrayType(StringType(), False))
def udf_split(text: str, separator: str = ','):
    """
    Spit udf
    """
    return text.split(sep=separator)


df = df.withColumn('users_array', udf_split(df['users']))
df.show()

# Explode - convert array to rows

df  = df.withColumn('user', explode(df['users_array']))
df.show()
More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts