Spark "ROW_ID"

Raymond Tang Raymond Tang 0 1273 0.85 index 5/16/2021

In Spark, there is no ROW_ID implemented. To add a unique sequential number for each record in the data frame, we can use ROW_NUMBER function.

Use ROW\_NUMBER function

The following code snippet uses ROW_NUMBER function to add a unique sequential number for the data frame.

Code snippet

from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, StringType, StructField, StructType
from pyspark.sql.functions import row_number
from pyspark.sql.window import *

appName = "PySpark Example - ROW_ID alternatives"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# Sample data
data = [(1, 'A'), (1, 'A'), (3, 'B')]

# schema
schema = StructType([StructField("ID", IntegerType(), True),
                     StructField("Value", StringType(), True)])

# Create Spark DaraFrame from pandas DataFrame
df = spark.createDataFrame(data, schema)
df.show()

df = df.withColumn('row_id', row_number().over(Window.orderBy("ID")))
df.show()

spark.stop()

Sample output:

+---+-----+
| ID|Value|
+---+-----+
|  1|    A|
|  1|    A|
|  3|    B|
+---+-----+

21/05/16 11:00:10 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+---+-----+------+
| ID|Value|row_id|
+---+-----+------+
|  1|    A|     1|
|  1|    A|     2|
|  3|    B|     3|
+---+-----+------+

As the warning message printed out, this can cause serious performance degradation without a partition. Please take this into consideration when using this function.

References

Spark SQL - ROW_NUMBER Window Functions - Kontext

pyspark spark-sql-function

Join the Discussion

View or add your thoughts below

Comments