Spark Hash Functions Introduction - MD5 and SHA
Spark provides a few hash functions like md5
, sha1
and sha2
(incl. SHA-224, SHA-256, SHA-384, and SHA-512). These functions can be used in Spark SQL or in DataFrame transformations using PySpark, Scala, etc. This article provides a simple summary about these commonly used functions.
A typical usage of these functions is to calculate a row checksum to simplify value comparisons. Hash collision can occur with very large volume of data so please be aware of that when using these functions.
Function md5
MD5 is a commonly used message digest algorithm. In Spark, this functions returns a hex string of the MD5 128-bit checksum of input expression.
md5(expr)
Use in Spark SQL
The following code snippet provides an example of using this function in Spark SQL.
spark-sql> select md5('ABC'); 902fbdd2b1df0c4f70b4a5d23525e932
Use it in PySpark
We can also directly use it in PySpark DataFrame transformations. The following code snippet creates a sample DataFrame object and then derive a new column using md5
function. It needs to be imported before it can be used.
____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.2.1 /_/ Using Python version 3.8.10 (default, Mar 15 2022 12:22:08) Spark context Web UI available at http://localhost:4041 Spark context available as 'sc' (master = local[*], app id = local-1655380700352). SparkSession available as 'spark'. >>> from pyspark.sql.functions import md5 >>> df = spark.createDataFrame([{'col1':'ABC'}]) >>> df.show() +----+ |col1| +----+ | ABC| +----+ >>> df = df.withColumn('md5',md5(df['col1'])) >>> df.show() >>> df.show(truncate=False) +----+--------------------------------+ |col1|md5 | +----+--------------------------------+ |ABC |902fbdd2b1df0c4f70b4a5d23525e932| +----+--------------------------------+
The value is the same as the Spark SQL shell output.
spark-submit
command. To start PySpark shell, you can directly run command pyspark. Function sha and sha1
Function sha
and sha1
are the same and they both return a hex string that represents the sha1 hash value of the input expression.
sha(expr)
sha1(expr)
Use in Spark SQL
The following code snippet uses sha
and sha1
functions in Spark SQL:
spark-sql> select sha('ABC'), sha1('ABC'); 3c01bdbb26f358bab27f267924aa2c9a03fcfdb8 3c01bdbb26f358bab27f267924aa2c9a03fcfdb8
Use it in PySpark
Similar as the md5 example, we could also import sha1
and use it in PySpark DataFrame transformations directly.
>>> from pyspark.sql.functions import sha1 >>> df = df.withColumn('sha1', sha1(df['col1'])) >>> df.show(truncate=False) +----+--------------------------------+----------------------------------------+ |col1|md5 |sha1 | +----+--------------------------------+----------------------------------------+ |ABC |902fbdd2b1df0c4f70b4a5d23525e932|3c01bdbb26f358bab27f267924aa2c9a03fcfdb8| +----+--------------------------------+----------------------------------------+
Function sha2
sha2
function is used to calculate the checksum of the input expression using the SHA-2 algorithms. The syntax is defined as below:
sha2(expr, bitLength)
For argument bitLength, Spark supports 224 (SHA-224), 256 (SHA-256), 384 (SHA-384), 512 (SHA-512).
The following are some examples using them in Spark SQL and Spark DataFrame.
Use sha2 in Spark SQL
spark-sql> select sha2('ABC', 224), sha2('ABC', 256); 107c5072b799c4771f328304cfe1ebb375eb6ea7f35a3aa753836fad b5d4045c3f466fa91fe2cc6abe79232a1a57cdf104f7a26e716e0a1e2789df78
In Spark SQL, we can directly use these built-in scalar functions.
Use sha2 in PySpark
Again, we will import this function and then use it with the previous PySpark DataFrame.
>>> from pyspark.sql.functions import sha2 >>> df = df.withColumn('sha-384', sha2(df['col1'],384)) >>> df = df.withColumn('sha-512', sha2(df['col1'],512)) >>> df.select('sha-384').show(truncate=False) +------------------------------------------------------------------------------------------------+ |sha-384 | +------------------------------------------------------------------------------------------------+ |1e02dc92a41db610c9bcdc9b5935d1fb9be5639116f6c67e97bc1a3ac649753baba7ba021c813e1fe20c0480213ad371| +------------------------------------------------------------------------------------------------+ >>> df.select('sha-512').show(truncate=False) +--------------------------------------------------------------------------------------------------------------------------------+ |sha-512 | +--------------------------------------------------------------------------------------------------------------------------------+ |397118fdac8d83ad98813c50759c85b8c47565d8268bf10da483153b747a74743a58a90e85aa9f705ce6984ffc128db567489817e4092d050d8a1cc596ddc119| +--------------------------------------------------------------------------------------------------------------------------------+
As you can see from the above output, the length of hashed value increases when you increases argument bitLength
.
Calculate row checksum
We can use these functions to calculate the checksum of Spark DataFrame rows.
>>> data=[] >>> data.append({'col1':'ABC','col2':2}) >>> data.append({'col1':'DEF','col2':4}) >>> df2 = spark.createDataFrame(data) >>> df2.show() +----+----+ |col1|col2| +----+----+ | ABC| 2| | DEF| 4| +----+----+ >>> from pyspark.sql.functions import concact_ws >>> df2 = df2.withColumn('row_checksum', md5(concat_ws('col1','col2')) ) >>> df2.show(truncate=False) +----+----+--------------------------------+ |col1|col2|row_checksum | +----+----+--------------------------------+ |ABC |2 |c81e728d9d4c2f636f067f89cc14862c| |DEF |4 |a87ff679a2f3e71d9181a67b7542122c| +----+----+--------------------------------+
Calculate hash value with salt
For these hash functions, the same input will get the same output. Thus it makes it possible to find out the original values by compare hashed values against a generated dictionary. This is usually called dictionary attack. To defend against dictionary attack, we can prepend or append a salt to the input string.
>>> df2 = df2.withColumn('sha_salt', sha2(concat_ws('salt', df2.col1),256)) >>> df2.show(truncate=False) +----+----+--------------------------------+----------------------------------------------------------------+ |col1|col2|row_checksum |sha_salt | +----+----+--------------------------------+----------------------------------------------------------------+ |ABC |2 |c81e728d9d4c2f636f067f89cc14862c|b5d4045c3f466fa91fe2cc6abe79232a1a57cdf104f7a26e716e0a1e2789df78| |DEF |4 |a87ff679a2f3e71d9181a67b7542122c|967c5a5b7e2fbbe3080a0c5cefea7c279570b16ae8465525538bc3b115267a45| +----+----+--------------------------------+----------------------------------------------------------------+