Spark "ROW_ID"
In Spark, there is no ROW_ID implemented. To add a unique sequential number for each record in the data frame, we can use ROW_NUMBER function.
Use ROW_NUMBER function
The following code snippet uses ROW_NUMBER function to add a unique sequential number for the data frame.
Code snippet
from pyspark.sql import SparkSession from pyspark.sql.types import IntegerType, StringType, StructField, StructType from pyspark.sql.functions import row_number from pyspark.sql.window import * appName = "PySpark Example - ROW_ID alternatives" master = "local" # Create Spark session spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .getOrCreate() # Sample data data = [(1, 'A'), (1, 'A'), (3, 'B')] # schema schema = StructType([StructField("ID", IntegerType(), True), StructField("Value", StringType(), True)]) # Create Spark DaraFrame from pandas DataFrame df = spark.createDataFrame(data, schema) df.show() df = df.withColumn('row_id', row_number().over(Window.orderBy("ID"))) df.show() spark.stop()
Sample output:
+---+-----+ | ID|Value| +---+-----+ | 1| A| | 1| A| | 3| B| +---+-----+ 21/05/16 11:00:10 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. +---+-----+------+ | ID|Value|row_id| +---+-----+------+ | 1| A| 1| | 1| A| 2| | 3| B| 3| +---+-----+------+
As the warning message printed out, this can cause serious performance degradation without a partition. Please take this into consideration when using this function.
References
Spark SQL - ROW_NUMBER Window Functions - Kontext
copyright
This page is subject to Site terms.
comment Comments
No comments yet.