PySpark DataFrame - union, unionAll and unionByName
insights Stats
Code snippets and tips for various programming languages/frameworks. All code examples are under MIT or Apache 2.0 license unless specified otherwise.
Code description
PySpark DataFrame provides three methods to union data together: union
, unionAll
and unionByName
. The first two are like Spark SQL UNION ALL clause which doesn't remove duplicates. unionAll
is the alias for union
. We can use distinct
method to deduplicate. The third function will use column names to resolve columns instead of positions. unionByName can also be used to merge two DataFrames with different schemas.
Syntax of unionByName
DataFrame.unionByName(other, allowMissingColumns=False)
If allowMissingColumns
is specified as True
, the missing columns in both DataFrame will be added with default value null
. This parameter is only available from Spark 3.1.0. For lower versions, the follow error may appear:
df1.unionByName(df4, allowMissingColumns=True).show(truncate=False)
TypeError: unionByName() got an unexpected keyword argument 'allowMissingColumns'
Outputs
The following are the outputs from the code snippet.
+---+---+---+ |c1 |c2 |c3 | +---+---+---+ |1 |A |100| |2 |B |200| +---+---+---+ +---+---+---+ |c1 |c2 |c3 | +---+---+---+ |1 |A |100| |1 |A |100| |2 |B |200| +---+---+---+ +---+---+---+ |c1 |c2 |c3 | +---+---+---+ |1 |A |100| |2 |B |200| +---+---+---+ +---+---+----+----+ |c1 |c2 |c3 |c4 | +---+---+----+----+ |1 |A |100 |null| |3 |C |null|ABC | +---+---+----+----+
Code snippet
from pyspark.sql import SparkSession appName = "PySpark union Examples" master = "local" spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .getOrCreate() spark.sparkContext.setLogLevel("ERROR") # Create DataFrames df1 = spark.createDataFrame([[1, 'A', 100]], ['c1', 'c2', 'c3']) df2 = spark.createDataFrame([[1, 'A', 100], [2, 'B', 200]], ['c1', 'c2', 'c3']) df3 = spark.createDataFrame([[2, 200, 'B']], ['c1', 'c3', 'c2']) df4 = spark.createDataFrame([[3, 'C', 'ABC']], ['c1', 'c2', 'c4']) # Use union method df1.union(df2).distinct().show(truncate=False) # Use unionAll method df1.unionAll(df2).show(truncate=False) # Use unionByName method df1.unionByName(df3).show(truncate=False) df1.unionByName(df4, allowMissingColumns=True).show(truncate=False)