In Spark, there is no ROW_ID implemented. To add a unique sequential number for each record in the data frame, we can use ROW_NUMBER function.
Use ROW\_NUMBER function
The following code snippet uses ROW_NUMBER function to add a unique sequential number for the data frame.
Code snippet
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, StringType, StructField, StructType
from pyspark.sql.functions import row_number
from pyspark.sql.window import *
appName = "PySpark Example - ROW_ID alternatives"
master = "local"
# Create Spark session
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.getOrCreate()
# Sample data
data = [(1, 'A'), (1, 'A'), (3, 'B')]
# schema
schema = StructType([StructField("ID", IntegerType(), True),
StructField("Value", StringType(), True)])
# Create Spark DaraFrame from pandas DataFrame
df = spark.createDataFrame(data, schema)
df.show()
df = df.withColumn('row_id', row_number().over(Window.orderBy("ID")))
df.show()
spark.stop()
Sample output:
+---+-----+
| ID|Value|
+---+-----+
| 1| A|
| 1| A|
| 3| B|
+---+-----+
21/05/16 11:00:10 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+---+-----+------+
| ID|Value|row_id|
+---+-----+------+
| 1| A| 1|
| 1| A| 2|
| 3| B| 3|
+---+-----+------+
As the warning message printed out, this can cause serious performance degradation without a partition. Please take this into consideration when using this function.