Code description
In Spark DataFrame, two APIs are provided to sort the rows in a DataFrame based on the provided column or columns: sort
and orderBy
. orderBy
is just the alias for sort
API.
Syntax
DataFrame.sort(*cols, **kwargs)
For *cols
, we can used it to specify a column name, a Column object (pyspark.sql.Column
), or a list of column names or Column objects.
For **kwargs
, we can use it to specify additional arguments. For PySpark, we can specify a parameter named ascending
. By default the value is True
. It can be a list of boolean values for each columns that are used to sort the records.
The code snippet provides the examples of sorting a DataFrame.
Sample outputs
+---+----+
| id|col1|
+---+----+
| 2| E|
| 4| E|
| 6| E|
| 8| E|
| 1| O|
| 3| O|
| 5| O|
| 7| O|
| 9| O|
+---+----+
+---+----+
| id|col1|
+---+----+
| 2| E|
| 4| E|
| 6| E|
| 8| E|
| 1| O|
| 3| O|
| 5| O|
| 7| O|
| 9| O|
+---+----+
+---+----+
| id|col1|
+---+----+
| 8| E|
| 6| E|
| 4| E|
| 2| E|
| 9| O|
| 7| O|
| 5| O|
| 3| O|
| 1| O|
+---+----+
Code snippet
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
app_name = "PySpark sort and orderBy Example"
master = "local"
# Create Spark session with Delta extension
builder = SparkSession.builder.appName(app_name) .master(master)
spark = builder.getOrCreate()
df = spark.range(1,10)
df = df.withColumn('col1', expr("case when id%2==0 then 'E' else 'O' end"))
# Sort
df.sort('col1').show()
df.sort(['col1','id'], ascending=True).show()
df.orderBy(['col1','id'], ascending=[True,False]).show()