PySpark DataFrame - percent_rank() Function

event 2022-08-18 thumb_up 0 visibility 1,825 comment 0 insights

more_vert

warning Please login first to view stats information.

Code description

In Spark SQL, PERCENT_RANK(Spark SQL - PERCENT_RANK Window Function). This code snippet implements percentile ranking (relative ranking) directly using PySpark DataFrame percent_rank API instead of Spark SQL.

Output:

+-------+-----+------------------+
|Student|Score|      percent_rank|
+-------+-----+------------------+
|    101|   56|               0.0|
|    109|   66|0.1111111111111111|
|    103|   70|0.2222222222222222|
|    110|   73|0.3333333333333333|
|    107|   75|0.4444444444444444|
|    102|   78|0.5555555555555556|
|    108|   81|0.6666666666666666|
|    104|   93|0.7777777777777778|
|    105|   95|0.8888888888888888|
|    106|   95|0.8888888888888888|
+-------+-----+------------------+

Code snippet

from pyspark.sql import SparkSession, Window
from pyspark.sql.functions import percent_rank

app_name = "PySpark percent_rank Window Function"
master = "local"

spark = SparkSession.builder \
    .appName(app_name) \
    .master(master) \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

data = [
    [101, 56],
    [102, 78],
    [103, 70],
    [104, 93],
    [105, 95],
    [106, 95],
    [107, 75],
    [108, 81],
    [109, 66],
    [110, 73]]

df = spark.createDataFrame(data, ['Student', 'Score'])

window = Window.orderBy("Score").rowsBetween(
    Window.unboundedPreceding, Window.currentRow)
df = df.withColumn('percent_rank', percent_rank().over(window))

df.show()

PySpark DataFrame - percent_rank() Function

insights Stats

Code description

Code snippet

Log in with external accounts