Raymond Raymond / Spark & PySpark

Spark SQL - Array Functions

event 2021-01-10 visibility 5,505 comment 0 insights toc
insights Stats

Unlike traditional RDBMS systems, Spark SQL supports complex types like array or map. There are a number of built-in functions to operate efficiently on array values.

array, array_repeat and sequence

ArrayType columns can be created directly using array or array_repeat function. The latter repeat one element multiple times based on the input parameter.  Similarly as many data frameworks, sequence function is also available to construct an array, which generates an array of elements from start to stop (inclusive), incrementing by step. Supported types include byte, short, integer, long, date and timestamp. The start and stop expressions must resolve to the same type. If start and stop expressions resolve to the 'date' or 'timestamp' type then the step expression must resolve to the 'interval' type, otherwise to the same type as the start and stop expressions.


spark-sql> select array(1,2,3,4,5);
array(1, 2, 3, 4, 5)

spark-sql> select array_repeat(1,10);
array_repeat(1, 10)

spark-sql> select sequence(1, 100, 4);
sequence(1, 100, 4)

spark-sql> select sequence(DATE'2021-01-01', DATE'2021-12-31', interval 1 month);
sequence(DATE '2021-01-01', DATE '2021-12-31', INTERVAL '1 months')


This function returns a random permutation of the input array.


spark-sql> select shuffle(array(1,2,3,4,5,2));
shuffle(array(1, 2, 3, 4, 5, 2))


Use this function to sort an array in ascending or descending order according to the natural ordering of the array elements. Null elements will be placed at the beginning of the returned array in ascending order or at the end of the returned array in descending order. The second parameter is boolean:

  • true - ascending order (default value)
  • false - descending order


spark-sql> select sort_array(array(1,2,3,4,5,2),true);
sort_array(array(1, 2, 3, 4, 5, 2), true)

spark-sql> select sort_array(array(1,2,3,4,5,2),false);
sort_array(array(1, 2, 3, 4, 5, 2), false)


Function array_contains is used to check whether value exists in an array.


spark-sql> select array_contains(array(1,2,3,4,5),1),array_contains(array(1,2,3,4,5),6);
array_contains(array(1, 2, 3, 4, 5), 1) array_contains(array(1, 2, 3, 4, 5), 6)
true    false


This function returns the index of the 1st element of the array. The index is 1-based like other SQL languages. 


spark-sql> select array_position(array(1,2,3,4,5,2),2);
array_position(array(1, 2, 3, 4, 5, 2), 2)

Element 2 appears twice in the array and the first index (position 2) is returned. 


Function array_remove removes element from the array.


spark-sql> select array_remove(array(1,2,3,4,5,2),2);
array_remove(array(1, 2, 3, 4, 5, 2), 2)

array_max and array_min

These two functions return maximum and minimum values in the array respectively. NULL values will be skipped.


spark-sql> select array_max(array(1,2,3,4,5,2)),array_min(array(1,2,3,4,5,2));
array_max(array(1, 2, 3, 4, 5, 2))      array_min(array(1, 2, 3, 4, 5, 2))
5       1


Function array_reverse is used to reverse an array.


spark-sql> select reverse('abcde');

*Note: string is an array of chars. 


Function slice(x, start, length) extract a subset from array x starting from index start (array indices start at 1, or starting from the end if start is negative) with the specified length.


spark-sql> select slice(array(1,2,3,4,5,2),1,3);
slice(array(1, 2, 3, 4, 5, 2), 1, 3)

spark-sql> select slice(array(1,2,3,4,5,2),-3,3);
slice(array(1, 2, 3, 4, 5, 2), -3, 3)


This function removes duplicates from an array.


spark-sql> select array_distinct(array(1,2,3,4,5,2));
array_distinct(array(1, 2, 3, 4, 5, 2))


This function unions two arrays without duplicates. The duplicates in each array itself will also be removed. 


spark-sql> select array_union(array(1,2,3,4,5,2),array(1,2,6));
array_union(array(1, 2, 3, 4, 5, 2), array(1, 2, 6))


Function array_except returns an array of the elements in the first array but not in the second, without duplicates.


spark-sql> select array_except(array(1,2,3,4,5,2),array(1,2,6));
array_except(array(1, 2, 3, 4, 5, 2), array(1, 2, 6))

As elements 1 and 2 both appear in the second array, the result excludes them.


This function returns the elements that exist in both arrays without duplicates.


spark-sql> select array_intersect(array(1,2,3,4,5,2),array(1,2,6));
array_intersect(array(1, 2, 3, 4, 5, 2), array(1, 2, 6))


Function arrays_overlap returns a boolean value or NULL. It returns true if the first array contains at least a non-null element that also is present in the second array. If the arrays have no common element and they are both non-empty and either of them contains a null element null is returned, false otherwise.


spark-sql> select arrays_overlap(array(1,2,3,4,5,2),array(1,2,6));
arrays_overlap(array(1, 2, 3, 4, 5, 2), array(1, 2, 6))

spark-sql> select arrays_overlap(array(1,2,3,4,5,2),array(6, NULL));
arrays_overlap(array(1, 2, 3, 4, 5, 2), array(6, CAST(NULL AS INT)))


This function zips two arrays. It returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.


spark-sql> select arrays_zip(array(1,2,3,4,5,2),array('a','b','c','d','e','f'));
arrays_zip(array(1, 2, 3, 4, 5, 2), array(a, b, c, d, e, f))
More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts