Code description
In Spark SQL, PERCENT_RANK(Spark SQL - PERCENT_RANK Window Function). This code snippet implements percentile ranking (relative ranking) directly using PySpark DataFrame percent_rank
API instead of Spark SQL.
Output:
+-------+-----+------------------+
|Student|Score| percent_rank|
+-------+-----+------------------+
| 101| 56| 0.0|
| 109| 66|0.1111111111111111|
| 103| 70|0.2222222222222222|
| 110| 73|0.3333333333333333|
| 107| 75|0.4444444444444444|
| 102| 78|0.5555555555555556|
| 108| 81|0.6666666666666666|
| 104| 93|0.7777777777777778|
| 105| 95|0.8888888888888888|
| 106| 95|0.8888888888888888|
+-------+-----+------------------+
Code snippet
from pyspark.sql import SparkSession, Window
from pyspark.sql.functions import percent_rank
app_name = "PySpark percent_rank Window Function"
master = "local"
spark = SparkSession.builder .appName(app_name) .master(master) .getOrCreate()
spark.sparkContext.setLogLevel("WARN")
data = [
[101, 56],
[102, 78],
[103, 70],
[104, 93],
[105, 95],
[106, 95],
[107, 75],
[108, 81],
[109, 66],
[110, 73]]
df = spark.createDataFrame(data, ['Student', 'Score'])
window = Window.orderBy("Score").rowsBetween(
Window.unboundedPreceding, Window.currentRow)
df = df.withColumn('percent_rank', percent_rank().over(window))
df.show()