Spark repartition vs. coalesce
Spark provides two functions to repartition data: repartition and coalesce. These two functions are created for different use cases.
About coalesce
As the word coalesce suggests, function coalesce is used to merge thing together or to come together and form a g group or a single unit.
The syntax is like the following:
def coalesce(numPartitions)
Parameter numPartitions is used to specify the new partition number. It returns a DataFrame with exactly numPartitions partitions. This function is commonly used to reduce partitions number of a dataframe.
The following example reduces the partitions number of DataFrame to 100.
df = df.coalesce(100)
About repartition
Function repartition can also be used to repartition a dataframe with more options available. The following is the function syntax:
DataFrame.repartition(numPartitions, *cols)
The following example repartitions the dataframe to 100 partitions.
df = df.repartition(100)
Default Spark hash partitioning function will be used to repartition the dataframe.
This example repartitions dataframe by multiple columns (Year, Month and Day):
df = df.repartition("Year", "Month", "Day")
Another example is to specify both parameters:
df = df.repartition(100, "Name")
The above example will repartition the dataframe to 100 partitions with column "Name" used as hash partitioning key.
About repartitionByRange
From Spark 2.4.0, a new function named repartitionByRange is added. The syntax is like the following:
DataFrame.repartitionByRange(numPartitions, *cols)
This function is very similar as repartition function. However the returned dataframe is range partitioned instead of hash partitioned.
For example, the following code snippet repartition the dataframe to 100 ranges based on column Population:
df = df.repartitionByRange(100, "Population")
This function uses Spark range partitioner which partitions sortable records by range into roughly equal ranges. Data sampling will be used to create ranges, which can result inconsistent results from different runs. Refer to RangePartitioner (Spark 2.3.0 JavaDoc) (apache.org) for more details about this range partitioner.
Key differences
- When use coalesce function, data reshuffling doesn't happen as it creates a narrow dependency. Each current partition will be remapped to a new partition when action occurs.
- repartition function can also be used to change partition number of a dataframe. It can be used to increase or decrease partition numbers, which involves data shuffling. It can also be used to partition data by certain columns in the dataframe.
- repartitionByRange function can be used to repartition using range partitioner to create partitions that are roughly equal.
If the purpose is to reduce partition size to a smaller number without involving partitioning by dataframe column(s), I recommend using coalesce function to get potential better performance.