PySpark split and explode example
Code description
This code snippet shows you how to define a function to split a string column to an array of strings using Python built-in split
function. It then explodes the array element from the split into using PySpark built-in explode
function.
Sample output
+----------+-----------------+--------------------+-----+
| category| users| users_array| user|
+----------+-----------------+--------------------+-----+
|Category A|user1,user2,user3|[user1, user2, us...|user1|
|Category A|user1,user2,user3|[user1, user2, us...|user2|
|Category A|user1,user2,user3|[user1, user2, us...|user3|
|Category B| user3,user4| [user3, user4]|user3|
|Category B| user3,user4| [user3, user4]|user4|
+----------+-----------------+--------------------+-----+
Code snippet
from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, ArrayType from pyspark.sql.functions import udf, explode appName = "PySpark Example - split" master = "local" # Create Spark session spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .getOrCreate() # List data = [{ 'category': 'Category A', 'users': 'user1,user2,user3' }, { 'category': 'Category B', 'users': 'user3,user4' }] # Create DF df = spark.createDataFrame(data) df.show() @udf(ArrayType(StringType(), False)) def udf_split(text: str, separator: str = ','): """ Spit udf """ return text.split(sep=separator) df = df.withColumn('users_array', udf_split(df['users'])) df.show() # Explode - convert array to rows df = df.withColumn('user', explode(df['users_array'])) df.show()
info Last modified by Administrator 2 years ago
copyright
This page is subject to Site terms.
comment Comments
No comments yet.