Spark Hash Functions Introduction - MD5 and SHA

Spark provides a few hash functions like md5, sha1 and sha2 (incl. SHA-224, SHA-256, SHA-384, and SHA-512). These functions can be used in Spark SQL or in DataFrame transformations using PySpark, Scala, etc. This article provides a simple summary about these commonly used functions.

A typical usage of these functions is to calculate a row checksum to simplify value comparisons. Hash collision can occur with very large volume of data so please be aware of that when using these functions.

Function md5

MD5 is a commonly used message digest algorithm. In Spark, this functions returns a hex string of the MD5 128-bit checksum of input expression.

md5(expr)

Use in Spark SQL

The following code snippet provides an example of using this function in Spark SQL.

spark-sql> select md5('ABC');
902fbdd2b1df0c4f70b4a5d23525e932

Use it in PySpark

We can also directly use it in PySpark DataFrame transformations. The following code snippet creates a sample DataFrame object and then derive a new column using md5 function. It needs to be imported before it can be used.

      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.2.1
      /_/

Using Python version 3.8.10 (default, Mar 15 2022 12:22:08)
Spark context Web UI available at http://localhost:4041
Spark context available as 'sc' (master = local[*], app id = local-1655380700352).
SparkSession available as 'spark'.
>>> from pyspark.sql.functions import md5
>>> df = spark.createDataFrame([{'col1':'ABC'}])
>>> df.show()
+----+
|col1|
+----+
| ABC|
+----+
>>> df = df.withColumn('md5',md5(df['col1']))
>>> df.show()
>>> df.show(truncate=False)
+----+--------------------------------+
|col1|md5                             |
+----+--------------------------------+
|ABC |902fbdd2b1df0c4f70b4a5d23525e932|
+----+--------------------------------+

The value is the same as the Spark SQL shell output.

infoI'm using PySpark interactive shell in the above example for simplicity. You can create a Python script file and then run it with spark-submit command. To start PySpark shell, you can directly run command pyspark.

Function sha and sha1

Function sha and sha1 are the same and they both return a hex string that represents the sha1 hash value of the input expression.

sha(expr)
sha1(expr)

Use in Spark SQL

The following code snippet uses sha and sha1 functions in Spark SQL:

spark-sql> select sha('ABC'), sha1('ABC');
3c01bdbb26f358bab27f267924aa2c9a03fcfdb8        3c01bdbb26f358bab27f267924aa2c9a03fcfdb8

Use it in PySpark

Similar as the md5 example, we could also import sha1 and use it in PySpark DataFrame transformations directly.

>>> from pyspark.sql.functions import sha1
>>> df = df.withColumn('sha1', sha1(df['col1']))
>>> df.show(truncate=False)
+----+--------------------------------+----------------------------------------+
|col1|md5                             |sha1                                    |
+----+--------------------------------+----------------------------------------+
|ABC |902fbdd2b1df0c4f70b4a5d23525e932|3c01bdbb26f358bab27f267924aa2c9a03fcfdb8|
+----+--------------------------------+----------------------------------------+

Function sha2

sha2 function is used to calculate the checksum of the input expression using the SHA-2 algorithms. The syntax is defined as below:

sha2(expr, bitLength)

For argument bitLength, Spark supports 224 (SHA-224), 256 (SHA-256), 384 (SHA-384), 512 (SHA-512).

The following are some examples using them in Spark SQL and Spark DataFrame.

Use sha2 in Spark SQL

spark-sql> select sha2('ABC', 224), sha2('ABC', 256);
107c5072b799c4771f328304cfe1ebb375eb6ea7f35a3aa753836fad        b5d4045c3f466fa91fe2cc6abe79232a1a57cdf104f7a26e716e0a1e2789df78

In Spark SQL, we can directly use these built-in scalar functions.

Use sha2 in PySpark

Again, we will import this function and then use it with the previous PySpark DataFrame.

>>> from pyspark.sql.functions import sha2
>>> df = df.withColumn('sha-384', sha2(df['col1'],384))
>>> df = df.withColumn('sha-512', sha2(df['col1'],512))
>>> df.select('sha-384').show(truncate=False)
+------------------------------------------------------------------------------------------------+
|sha-384                                                                                         |
+------------------------------------------------------------------------------------------------+
|1e02dc92a41db610c9bcdc9b5935d1fb9be5639116f6c67e97bc1a3ac649753baba7ba021c813e1fe20c0480213ad371|
+------------------------------------------------------------------------------------------------+
>>> df.select('sha-512').show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------+
|sha-512                                                                                                                         |
+--------------------------------------------------------------------------------------------------------------------------------+
|397118fdac8d83ad98813c50759c85b8c47565d8268bf10da483153b747a74743a58a90e85aa9f705ce6984ffc128db567489817e4092d050d8a1cc596ddc119|
+--------------------------------------------------------------------------------------------------------------------------------+

As you can see from the above output, the length of hashed value increases when you increases argument bitLength.

Calculate row checksum

We can use these functions to calculate the checksum of Spark DataFrame rows.

>>> data=[]
>>> data.append({'col1':'ABC','col2':2})
>>> data.append({'col1':'DEF','col2':4})
>>> df2 = spark.createDataFrame(data)
>>> df2.show()
+----+----+
|col1|col2|
+----+----+
| ABC|   2|
| DEF|   4|
+----+----+
>>> from pyspark.sql.functions import concact_ws
>>> df2 = df2.withColumn('row_checksum', md5(concat_ws('col1','col2')) )
>>> df2.show(truncate=False)
+----+----+--------------------------------+
|col1|col2|row_checksum                    |
+----+----+--------------------------------+
|ABC |2   |c81e728d9d4c2f636f067f89cc14862c|
|DEF |4   |a87ff679a2f3e71d9181a67b7542122c|
+----+----+--------------------------------+

Calculate hash value with salt

For these hash functions, the same input will get the same output. Thus it makes it possible to find out the original values by compare hashed values against a generated dictionary. This is usually called dictionary attack. To defend against dictionary attack, we can prepend or append a salt to the input string.

>>> df2 = df2.withColumn('sha_salt', sha2(concat_ws('salt', df2.col1),256))
>>> df2.show(truncate=False)
+----+----+--------------------------------+----------------------------------------------------------------+
|col1|col2|row_checksum                    |sha_salt                                                        |
+----+----+--------------------------------+----------------------------------------------------------------+
|ABC |2   |c81e728d9d4c2f636f067f89cc14862c|b5d4045c3f466fa91fe2cc6abe79232a1a57cdf104f7a26e716e0a1e2789df78|
|DEF |4   |a87ff679a2f3e71d9181a67b7542122c|967c5a5b7e2fbbe3080a0c5cefea7c279570b16ae8465525538bc3b115267a45|
+----+----+--------------------------------+----------------------------------------------------------------+

Function md5

Use in Spark SQL

Use it in PySpark

Function sha and sha1

Use in Spark SQL

Use it in PySpark

Function sha2

Use sha2 in Spark SQL

Use sha2 in PySpark

Calculate row checksum

Calculate hash value with salt

In this article