Spark SQL - Array Functions

access_time 6 days ago visibility6 comment 0

Unlike traditional RDBMS systems, Spark SQL supports complex types like array or map. There are a number of built-in functions to operate efficiently on array values.

array, array_repeat and sequence

ArrayType columns can be created directly using array or array_repeat function. The latter repeat one element multiple times based on the input parameter.  Similarly as many data frameworks, sequence function is also available to construct an array, which generates an array of elements from start to stop (inclusive), incrementing by step. Supported types include byte, short, integer, long, date and timestamp. The start and stop expressions must resolve to the same type. If start and stop expressions resolve to the 'date' or 'timestamp' type then the step expression must resolve to the 'interval' type, otherwise to the same type as the start and stop expressions.


spark-sql> select array(1,2,3,4,5);
array(1, 2, 3, 4, 5)

spark-sql> select array_repeat(1,10);
array_repeat(1, 10)

spark-sql> select sequence(1, 100, 4);
sequence(1, 100, 4)

spark-sql> select sequence(DATE'2021-01-01', DATE'2021-12-31', interval 1 month);
sequence(DATE '2021-01-01', DATE '2021-12-31', INTERVAL '1 months')


This function returns a random permutation of the input array.


spark-sql> select shuffle(array(1,2,3,4,5,2));
shuffle(array(1, 2, 3, 4, 5, 2))


Use this function to sort an array in ascending or descending order according to the natural ordering of the array elements. Null elements will be placed at the beginning of the returned array in ascending order or at the end of the returned array in descending order. The second parameter is boolean:

  • true - ascending order (default value)
  • false - descending order


spark-sql> select sort_array(array(1,2,3,4,5,2),true);
sort_array(array(1, 2, 3, 4, 5, 2), true)

spark-sql> select sort_array(array(1,2,3,4,5,2),false);
sort_array(array(1, 2, 3, 4, 5, 2), false)


Function array_contains is used to check whether value exists in an array.


spark-sql> select array_contains(array(1,2,3,4,5),1),array_contains(array(1,2,3,4,5),6);
array_contains(array(1, 2, 3, 4, 5), 1) array_contains(array(1, 2, 3, 4, 5), 6)
true    false


This function returns the index of the 1st element of the array. The index is 1-based like other SQL languages. 


spark-sql> select array_position(array(1,2,3,4,5,2),2);
array_position(array(1, 2, 3, 4, 5, 2), 2)

Element 2 appears twice in the array and the first index (position 2) is returned. 


Function array_remove removes element from the array.


spark-sql> select array_remove(array(1,2,3,4,5,2),2);
array_remove(array(1, 2, 3, 4, 5, 2), 2)

array_max and array_min

These two functions return maximum and minimum values in the array respectively. NULL values will be skipped.


spark-sql> select array_max(array(1,2,3,4,5,2)),array_min(array(1,2,3,4,5,2));
array_max(array(1, 2, 3, 4, 5, 2))      array_min(array(1, 2, 3, 4, 5, 2))
5       1


Function array_reverse is used to reverse an array.


spark-sql> select reverse('abcde');

*Note: string is an array of chars. 


Function slice(x, start, length) extract a subset from array x starting from index start (array indices start at 1, or starting from the end if start is negative) with the specified length.


spark-sql> select slice(array(1,2,3,4,5,2),1,3);
slice(array(1, 2, 3, 4, 5, 2), 1, 3)

spark-sql> select slice(array(1,2,3,4,5,2),-3,3);
slice(array(1, 2, 3, 4, 5, 2), -3, 3)


This function removes duplicates from an array.


spark-sql> select array_distinct(array(1,2,3,4,5,2));
array_distinct(array(1, 2, 3, 4, 5, 2))


This function unions two arrays without duplicates. The duplicates in each array itself will also be removed. 


spark-sql> select array_union(array(1,2,3,4,5,2),array(1,2,6));
array_union(array(1, 2, 3, 4, 5, 2), array(1, 2, 6))


Function array_except returns an array of the elements in the first array but not in the second, without duplicates.


spark-sql> select array_except(array(1,2,3,4,5,2),array(1,2,6));
array_except(array(1, 2, 3, 4, 5, 2), array(1, 2, 6))

As elements 1 and 2 both appear in the second array, the result excludes them.


This function returns the elements that exist in both arrays without duplicates.


spark-sql> select array_intersect(array(1,2,3,4,5,2),array(1,2,6));
array_intersect(array(1, 2, 3, 4, 5, 2), array(1, 2, 6))


Function arrays_overlap returns a boolean value or NULL. It returns true if the first array contains at least a non-null element that also is present in the second array. If the arrays have no common element and they are both non-empty and either of them contains a null element null is returned, false otherwise.


spark-sql> select arrays_overlap(array(1,2,3,4,5,2),array(1,2,6));
arrays_overlap(array(1, 2, 3, 4, 5, 2), array(1, 2, 6))

spark-sql> select arrays_overlap(array(1,2,3,4,5,2),array(6, NULL));
arrays_overlap(array(1, 2, 3, 4, 5, 2), array(6, CAST(NULL AS INT)))


This function zips two arrays. It returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.


spark-sql> select arrays_zip(array(1,2,3,4,5,2),array('a','b','c','d','e','f'));
arrays_zip(array(1, 2, 3, 4, 5, 2), array(a, b, c, d, e, f))
copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Follow Kontext

Get our latest updates on LinkedIn or Twitter.

Want to publish your article on Kontext?

Learn more

More from Kontext

visibility 8
thumb_up 0
access_time 9 days ago

Spark LAG function provides access to a row at a given offset that comes before the current row in the windows. This function can be used in a SELECT statement to compare values in the current row with values in a previous row. lag(input[, offset[, default]]) OVER ([PARYITION BY ..] ORDER BY ...) ...

visibility 16
thumb_up 0
access_time 13 days ago

ROW_NUMBER in Spark assigns a unique sequential number (starting from 1) to each record based on the ordering of rows in each window partition. It is commonly used to deduplicate data. The following sample SQL uses ROW_NUMBER function without PARTITION BY clause: SELECT TXN.*, ROW_NUMBER() OVER ...

visibility 4
thumb_up 0
access_time 6 days ago

In article  Scala: Parse JSON String as Spark DataFrame , it shows how to convert JSON string to Spark DataFrame; this article show the other way around - convert complex columns to a JSON string using to_json function. Function ' to_json(expr[, options]) ' returns a JSON string with a ...