Spark "ROW_ID"
In Spark, there is no ROW_ID implemented. To add a unique sequential number for each record in the data frame, we can use ROW_NUMBER function.
Use ROW_NUMBER function
The following code snippet uses ROW_NUMBER function to add a unique sequential number for the data frame.
Code snippet
from pyspark.sql import SparkSession from pyspark.sql.types import IntegerType, StringType, StructField, StructType from pyspark.sql.functions import row_number from pyspark.sql.window import * appName = "PySpark Example - ROW_ID alternatives" master = "local" # Create Spark session spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .getOrCreate() # Sample data data = [(1, 'A'), (1, 'A'), (3, 'B')] # schema schema = StructType([StructField("ID", IntegerType(), True), StructField("Value", StringType(), True)]) # Create Spark DaraFrame from pandas DataFrame df = spark.createDataFrame(data, schema) df.show() df = df.withColumn('row_id', row_number().over(Window.orderBy("ID"))) df.show() spark.stop()
Sample output:
+---+-----+ | ID|Value| +---+-----+ | 1| A| | 1| A| | 3| B| +---+-----+ 21/05/16 11:00:10 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. +---+-----+------+ | ID|Value|row_id| +---+-----+------+ | 1| A| 1| | 1| A| 2| | 3| B| 3| +---+-----+------+
As the warning message printed out, this can cause serious performance degradation without a partition. Please take this into consideration when using this function.
References
copyright
This page is subject to Site terms.
comment Comments
No comments yet.
Log in with external accounts
warning Please login first to view stats information.