In PySpark, we can use explode
function to explode an array or a map column. After exploding, the DataFrame will end up with more rows.
Code snippet
The following code snippet explode an array column.
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
appName = "PySpark DataFrame - explode function"
master = "local"
# Create Spark session
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.getOrCreate()
spark.sparkContext.setLogLevel('WARN')
data = [{"values": [1, 2, 3, 4, 5]}, {"values": [6, 7, 8]}]
df = spark.createDataFrame(data)
df.show()
df.withColumn('value', F.explode(df['values'])).show()
Each value of the array becomes a column in a row:
+---------------+-----+
| values|value|
+---------------+-----+
|[1, 2, 3, 4, 5]| 1|
|[1, 2, 3, 4, 5]| 2|
|[1, 2, 3, 4, 5]| 3|
|[1, 2, 3, 4, 5]| 4|
|[1, 2, 3, 4, 5]| 5|
| [6, 7, 8]| 6|
| [6, 7, 8]| 7|
| [6, 7, 8]| 8|
+---------------+-----+
For map column, we can also use explode
function.
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
appName = "PySpark DataFrame - explode function"
master = "local"
# Create Spark session
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.getOrCreate()
spark.sparkContext.setLogLevel('WARN')
data = [{"values": {"a": "100", "b": "200"}},
{"values": {"a": "1000", "b": "2000"}}]
df = spark.createDataFrame(data)
df.show()
df = df.select("*", F.explode(df['values']).alias('key', 'value'))
df.show()
The output includes one row for each attribute in each map object as the following shows:
+--------------------+
| values|
+--------------------+
|[a -> 100, b -> 200]|
|[a -> 1000, b -> ...|
+--------------------+
+--------------------+---+-----+
| values|key|value|
+--------------------+---+-----+
|[a -> 100, b -> 200]| a| 100|
|[a -> 100, b -> 200]| b| 200|
|[a -> 1000, b -> ...| a| 1000|
|[a -> 1000, b -> ...| b| 2000|
+--------------------+---+-----+