PySpark DataFrame - explode Array and Map Columns
more_vert
In PySpark, we can use explode
function to explode an array or a map column. After exploding, the DataFrame will end up with more rows.
Code snippet
The following code snippet explode an array column.
from pyspark.sql import SparkSession import pyspark.sql.functions as F appName = "PySpark DataFrame - explode function" master = "local" # Create Spark session spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .getOrCreate() spark.sparkContext.setLogLevel('WARN') data = [{"values": [1, 2, 3, 4, 5]}, {"values": [6, 7, 8]}] df = spark.createDataFrame(data) df.show() df.withColumn('value', F.explode(df['values'])).show()
Each value of the array becomes a column in a row:
+---------------+-----+ | values|value| +---------------+-----+ |[1, 2, 3, 4, 5]| 1| |[1, 2, 3, 4, 5]| 2| |[1, 2, 3, 4, 5]| 3| |[1, 2, 3, 4, 5]| 4| |[1, 2, 3, 4, 5]| 5| | [6, 7, 8]| 6| | [6, 7, 8]| 7| | [6, 7, 8]| 8| +---------------+-----+
For map column, we can also use explode
function.
from pyspark.sql import SparkSession import pyspark.sql.functions as F appName = "PySpark DataFrame - explode function" master = "local" # Create Spark session spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .getOrCreate() spark.sparkContext.setLogLevel('WARN') data = [{"values": {"a": "100", "b": "200"}}, {"values": {"a": "1000", "b": "2000"}}] df = spark.createDataFrame(data) df.show() df = df.select("*", F.explode(df['values']).alias('key', 'value')) df.show()
The output includes one row for each attribute in each map object as the following shows:
+--------------------+ | values| +--------------------+ |[a -> 100, b -> 200]| |[a -> 1000, b -> ...| +--------------------+ +--------------------+---+-----+ | values|key|value| +--------------------+---+-----+ |[a -> 100, b -> 200]| a| 100| |[a -> 100, b -> 200]| b| 200| |[a -> 1000, b -> ...| a| 1000| |[a -> 1000, b -> ...| b| 2000| +--------------------+---+-----+
copyright
This page is subject to Site terms.
comment Comments
No comments yet.
Log in with external accounts
warning Please login first to view stats information.
article
Improve PySpark Performance using Pandas UDF with Apache Arrow
code
Remove Special Characters from Column in PySpark DataFrame
article
SCD Type 2 - Implement FULL Merge with Delta Lake Table via PySpark
article
Pass Environment Variables to Executors in PySpark
article
Set Spark Python Versions via PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON
Read more (115)