Use sort() and orderBy() with PySpark DataFrame

Code description

In Spark DataFrame, two APIs are provided to sort the rows in a DataFrame based on the provided column or columns: sort and orderBy. orderBy is just the alias for sort API.

Syntax

    DataFrame.sort(*cols, **kwargs)

For *cols, we can used it to specify a column name, a Column object (pyspark.sql.Column), or a list of column names or Column objects.

For **kwargs, we can use it to specify additional arguments. For PySpark, we can specify a parameter named ascending. By default the value is True. It can be a list of boolean values for each columns that are used to sort the records.

The code snippet provides the examples of sorting a DataFrame.

Sample outputs

    +---+----+
    | id|col1|
    +---+----+
    |  2|   E|
    |  4|   E|
    |  6|   E|
    |  8|   E|
    |  1|   O|
    |  3|   O|
    |  5|   O|
    |  7|   O|
    |  9|   O|
    +---+----+
    
    +---+----+
    | id|col1|
    +---+----+
    |  2|   E|
    |  4|   E|
    |  6|   E|
    |  8|   E|
    |  1|   O|
    |  3|   O|
    |  5|   O|
    |  7|   O|
    |  9|   O|
    +---+----+
    
    +---+----+
    | id|col1|
    +---+----+
    |  8|   E|
    |  6|   E|
    |  4|   E|
    |  2|   E|
    |  9|   O|
    |  7|   O|
    |  5|   O|
    |  3|   O|
    |  1|   O|
    +---+----+

Code snippet

    from pyspark.sql import SparkSession
    from pyspark.sql.functions import *
    
    app_name = "PySpark sort and orderBy Example"
    master = "local"
    
    # Create Spark session with Delta extension
    builder = SparkSession.builder.appName(app_name)         .master(master)
    spark = builder.getOrCreate()
    
    df = spark.range(1,10)
    df = df.withColumn('col1', expr("case when id%2==0 then 'E' else 'O' end"))
    
    # Sort
    df.sort('col1').show()
    df.sort(['col1','id'], ascending=True).show()
    df.orderBy(['col1','id'], ascending=[True,False]).show()

Code description

Syntax

Sample outputs

Code snippet

In this article