PySpark DataFrame - inner, left and right Joins

event 2022-08-21 visibility 218 comment 0 insights
more_vert
insights Stats
PySpark DataFrame - inner, left and right Joins
Kontext Kontext Code Snippets & Tips

Code snippets and tips for various programming languages/frameworks. All code examples are under MIT or Apache 2.0 license unless specified otherwise. 

Code description

This code snippet shows you how to perform inner, left and right joins with DataFrame.join API. 

def join(self, other, on=None, how=None)

Supported join types

The default join type is inner. The supported values for parameter how are: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti and left_anti.

To learn about the these different join types, refer to article Spark SQL Joins with Examples.

Join via multiple columns

If there are more than one column to join, we can specify on parameter as a list of column name:

df1.join(df2, on=['id','other_column'], how='left')

Output from the code snippet:

+---+----+
| id|attr|
+---+----+
|  1|   A|
|  2|   B|
+---+----+

+---+--------+
| id|attr_int|
+---+--------+
|  1|     100|
|  2|     200|
|  3|     300|
+---+--------+

+---+----+--------+
| id|attr|attr_int|
+---+----+--------+
|  1|   A|     100|
|  2|   B|     200|
+---+----+--------+

+---+----+--------+
| id|attr|attr_int|
+---+----+--------+
|  1|   A|     100|
|  2|   B|     200|
+---+----+--------+

+---+----+--------+
| id|attr|attr_int|
+---+----+--------+
|  1|   A|     100|
|  2|   B|     200|
|  3|null|     300|
+---+----+--------+

Code snippet

from pyspark.sql import SparkSession
from pyspark import SparkConf

app_name = "PySpark - Joins Example"
master = "local[8]"

conf = SparkConf().setAppName(app_name)\
    .setMaster(master)

spark = SparkSession.builder.config(conf=conf) \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

df1 = spark.createDataFrame([[1, 'A'], [2, 'B']], ['id', 'attr'])
df1.show()
df2 = spark.createDataFrame([[1, 100], [2, 200], [3, 300]], ['id', 'attr_int'])
df2.show()

# Joins 
df = df1.join(df2, on='id', how='inner')
df.show()

df = df1.join(df2, on='id', how='left')
df.show()

df = df1.join(df2, on='id', how='right')
df.show()
More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts