Spark SQL - Array Functions

Unlike traditional RDBMS systems, Spark SQL supports complex types like array or map. There are a number of built-in functions to operate efficiently on array values.

array, array\_repeat and sequence

ArrayType columns can be created directly using array or array_repeatfunction. The latter repeat one element multiple times based on the input parameter. Similarly as many data frameworks, sequencefunction is also available to construct an array, which generates an array of elements from start to stop (inclusive), incrementing by step. Supported types include byte, short, integer, long, date and timestamp. The start and stop expressions must resolve to the same type. If start and stop expressions resolve to the 'date' or 'timestamp' type then the step expression must resolve to the 'interval' type, otherwise to the same type as the start and stop expressions.

Example:

spark-sql> select array(1,2,3,4,5);
array(1, 2, 3, 4, 5)
[1,2,3,4,5]

spark-sql> select array_repeat(1,10);
array_repeat(1, 10)
[1,1,1,1,1,1,1,1,1,1]

spark-sql> select sequence(1, 100, 4);
sequence(1, 100, 4)
[1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77,81,85,89,93,97]

spark-sql> select sequence(DATE'2021-01-01', DATE'2021-12-31', interval 1 month);
sequence(DATE '2021-01-01', DATE '2021-12-31', INTERVAL '1 months')
[2021-01-01,2021-02-01,2021-03-01,2021-04-01,2021-05-01,2021-06-01,2021-07-01,2021-08-01,2021-09-01,2021-10-01,2021-11-01,2021-12-01]

shuffle

This function returns a random permutation of the input array.

Example:

spark-sql> select shuffle(array(1,2,3,4,5,2));
shuffle(array(1, 2, 3, 4, 5, 2))
[3,5,2,1,4,2]

sort\_array

Use this function to sort an array in ascending or descending order according to the natural ordering of the array elements. Null elements will be placed at the beginning of the returned array in ascending order or at the end of the returned array in descending order. The second parameter is boolean:

true - ascending order (default value)
false - descending order

Example:

spark-sql> select sort_array(array(1,2,3,4,5,2),true);
sort_array(array(1, 2, 3, 4, 5, 2), true)
[1,2,2,3,4,5]

spark-sql> select sort_array(array(1,2,3,4,5,2),false);
sort_array(array(1, 2, 3, 4, 5, 2), false)
[5,4,3,2,2,1]

array\_contains

Function array_containsis used to check whether value exists in an array.

Example:

spark-sql> select array_contains(array(1,2,3,4,5),1),array_contains(array(1,2,3,4,5),6);
array_contains(array(1, 2, 3, 4, 5), 1) array_contains(array(1, 2, 3, 4, 5), 6)
true    false

array\_position

This function returns the index of the 1st element of the array. The index is 1-based like other SQL languages.

Example:

spark-sql> select array_position(array(1,2,3,4,5,2),2);
array_position(array(1, 2, 3, 4, 5, 2), 2)
2

Element 2 appears twice in the array and the first index (position 2) is returned.

array\_remove

Function array_remove removes element from the array.

Example:

spark-sql> select array_remove(array(1,2,3,4,5,2),2);
array_remove(array(1, 2, 3, 4, 5, 2), 2)
[1,3,4,5]

array\_max and array\_min

These two functions return maximum and minimum values in the array respectively. NULL values will be skipped.

Example:

spark-sql> select array_max(array(1,2,3,4,5,2)),array_min(array(1,2,3,4,5,2));
array_max(array(1, 2, 3, 4, 5, 2))      array_min(array(1, 2, 3, 4, 5, 2))
5       1

array\_reverse

Function array_reverseis used to reverse an array.

Example:

spark-sql> select reverse('abcde');
reverse(abcde)
edcba

*Note: string is an array of chars.

slice

Function slice(x, start, length) extract a subset from array x starting from index start (array indices start at 1, or starting from the end if start is negative) with the specified length.

Example:

spark-sql> select slice(array(1,2,3,4,5,2),1,3);
slice(array(1, 2, 3, 4, 5, 2), 1, 3)
[1,2,3]

spark-sql> select slice(array(1,2,3,4,5,2),-3,3);
slice(array(1, 2, 3, 4, 5, 2), -3, 3)
[4,5,2]

array\_distinct

This function removes duplicates from an array.

Example:

spark-sql> select array_distinct(array(1,2,3,4,5,2));
array_distinct(array(1, 2, 3, 4, 5, 2))
[1,2,3,4,5]

array\_union

This function unions two arrays without duplicates. The duplicates in each array itself will also be removed.

Examples:

spark-sql> select array_union(array(1,2,3,4,5,2),array(1,2,6));
array_union(array(1, 2, 3, 4, 5, 2), array(1, 2, 6))
[1,2,3,4,5,6]

array\_except

Function array_exceptreturns an array of the elements in the first array but not in the second, without duplicates.

Example:

spark-sql> select array_except(array(1,2,3,4,5,2),array(1,2,6));
array_except(array(1, 2, 3, 4, 5, 2), array(1, 2, 6))
[3,4,5]

As elements 1 and 2 both appear in the second array, the result excludes them.

array\_intersect

This function returns the elements that exist in both arrays without duplicates.

Example:

spark-sql> select array_intersect(array(1,2,3,4,5,2),array(1,2,6));
array_intersect(array(1, 2, 3, 4, 5, 2), array(1, 2, 6))
[1,2]

arrays\_overlap

Function arrays_overlapreturns a boolean value or NULL. It returns true if the first array contains at least a non-null element that also is present in the second array. If the arrays have no common element and they are both non-empty and either of them contains a null element null is returned, false otherwise.

Examples:

spark-sql> select arrays_overlap(array(1,2,3,4,5,2),array(1,2,6));
arrays_overlap(array(1, 2, 3, 4, 5, 2), array(1, 2, 6))
true

spark-sql> select arrays_overlap(array(1,2,3,4,5,2),array(6, NULL));
arrays_overlap(array(1, 2, 3, 4, 5, 2), array(6, CAST(NULL AS INT)))
NULL

arrays\_zip

This function zips two arrays. It returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.

Example:

spark-sql> select arrays_zip(array(1,2,3,4,5,2),array('a','b','c','d','e','f'));
arrays_zip(array(1, 2, 3, 4, 5, 2), array(a, b, c, d, e, f))
[{"0":1,"1":"a"},{"0":2,"1":"b"},{"0":3,"1":"c"},{"0":4,"1":"d"},{"0":5,"1":"e"},{"0":2,"1":"f"}]

array, array\_repeat and sequence

shuffle

sort\_array

array\_contains

array\_position

array\_remove

array\_max and array\_min

array\_reverse

slice

array\_distinct

array\_union

array\_except

array\_intersect

arrays\_overlap

arrays\_zip

In this article