Use sort() and orderBy() with PySpark DataFrame

In Spark DataFrame, two APIs are provided to sort the rows in a DataFrame based on the provided column or columns: `sort `and `orderBy`. `orderBy `is just the alias for `sort `API. ## Syntax ``` DataFrame.sort(*cols, **kwargs) ``` For `*cols`, we can used it to specify a column name, a Column object (`pyspark.sql.Column`), or a list of column names or Column objects. For `**kwargs`, we can use it to specify additional arguments. For PySpark, we can specify a parameter named `ascending`. By default the value is `True`. It can be a list of boolean values for each columns that are used to sort the records. The code snippet provides the examples of sorting a DataFrame. ## Sample outputs ``` +---+----+ | id|col1| +---+----+ | 2| E| | 4| E| | 6| E| | 8| E| | 1| O| | 3| O| | 5| O| | 7| O| | 9| O| +---+----+ +---+----+ | id|col1| +---+----+ | 2| E| | 4| E| | 6| E| | 8| E| | 1| O| | 3| O| | 5| O| | 7| O| | 9| O| +---+----+ +---+----+ | id|col1| +---+----+ | 8| E| | 6| E| | 4| E| | 2| E| | 9| O| | 7| O| | 5| O| | 3| O| | 1| O| +---+----+ ```

Kontext Kontext 0 412 0.40 index 9/3/2022

Code description

In Spark DataFrame, two APIs are provided to sort the rows in a DataFrame based on the provided column or columns: sort and orderByorderBy is just the alias for sort API.

Syntax

    DataFrame.sort(*cols, **kwargs)  
    

For *cols, we can used it to specify a column name, a Column object (pyspark.sql.Column), or a list of column names or Column objects.

For **kwargs, we can use it to specify additional arguments. For PySpark, we can specify a parameter named ascending. By default the value is True. It can be a list of boolean values for each columns that are used to sort the records.

The code snippet provides the examples of sorting a DataFrame.

Sample outputs

    +---+----+
    | id|col1|
    +---+----+
    |  2|   E|
    |  4|   E|
    |  6|   E|
    |  8|   E|
    |  1|   O|
    |  3|   O|
    |  5|   O|
    |  7|   O|
    |  9|   O|
    +---+----+
    
    +---+----+
    | id|col1|
    +---+----+
    |  2|   E|
    |  4|   E|
    |  6|   E|
    |  8|   E|
    |  1|   O|
    |  3|   O|
    |  5|   O|
    |  7|   O|
    |  9|   O|
    +---+----+
    
    +---+----+
    | id|col1|
    +---+----+
    |  8|   E|
    |  6|   E|
    |  4|   E|
    |  2|   E|
    |  9|   O|
    |  7|   O|
    |  5|   O|
    |  3|   O|
    |  1|   O|
    +---+----+  
    

Code snippet

    from pyspark.sql import SparkSession
    from pyspark.sql.functions import *
    
    app_name = "PySpark sort and orderBy Example"
    master = "local"
    
    # Create Spark session with Delta extension
    builder = SparkSession.builder.appName(app_name)         .master(master)
    spark = builder.getOrCreate()
    
    df = spark.range(1,10)
    df = df.withColumn('col1', expr("case when id%2==0 then 'E' else 'O' end"))
    
    # Sort
    df.sort('col1').show()
    df.sort(['col1','id'], ascending=True).show()
    df.orderBy(['col1','id'], ascending=[True,False]).show()
pyspark python

Join the Discussion

View or add your thoughts below

Comments