Spark SQL - Array Functions
Unlike traditional RDBMS systems, Spark SQL supports complex types like array or map. There are a number of built-in functions to operate efficiently on array values.
array, array_repeat and sequence
ArrayType columns can be created directly using array or array_repeat function. The latter repeat one element multiple times based on the input parameter. Similarly as many data frameworks, sequence function is also available to construct an array, which generates an array of elements from start to stop (inclusive), incrementing by step. Supported types include byte, short, integer, long, date and timestamp. The start and stop expressions must resolve to the same type. If start and stop expressions resolve to the 'date' or 'timestamp' type then the step expression must resolve to the 'interval' type, otherwise to the same type as the start and stop expressions.
Example:
spark-sql> select array(1,2,3,4,5); array(1, 2, 3, 4, 5) [1,2,3,4,5] spark-sql> select array_repeat(1,10); array_repeat(1, 10) [1,1,1,1,1,1,1,1,1,1] spark-sql> select sequence(1, 100, 4); sequence(1, 100, 4) [1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77,81,85,89,93,97] spark-sql> select sequence(DATE'2021-01-01', DATE'2021-12-31', interval 1 month); sequence(DATE '2021-01-01', DATE '2021-12-31', INTERVAL '1 months') [2021-01-01,2021-02-01,2021-03-01,2021-04-01,2021-05-01,2021-06-01,2021-07-01,2021-08-01,2021-09-01,2021-10-01,2021-11-01,2021-12-01]
shuffle
This function returns a random permutation of the input array.
Example:
spark-sql> select shuffle(array(1,2,3,4,5,2)); shuffle(array(1, 2, 3, 4, 5, 2)) [3,5,2,1,4,2]
sort_array
Use this function to sort an array in ascending or descending order according to the natural ordering of the array elements. Null elements will be placed at the beginning of the returned array in ascending order or at the end of the returned array in descending order. The second parameter is boolean:
- true - ascending order (default value)
- false - descending order
Example:
spark-sql> select sort_array(array(1,2,3,4,5,2),true); sort_array(array(1, 2, 3, 4, 5, 2), true) [1,2,2,3,4,5] spark-sql> select sort_array(array(1,2,3,4,5,2),false); sort_array(array(1, 2, 3, 4, 5, 2), false) [5,4,3,2,2,1]
array_contains
Function array_contains is used to check whether value exists in an array.
Example:
spark-sql> select array_contains(array(1,2,3,4,5),1),array_contains(array(1,2,3,4,5),6); array_contains(array(1, 2, 3, 4, 5), 1) array_contains(array(1, 2, 3, 4, 5), 6) true false
array_position
This function returns the index of the 1st element of the array. The index is 1-based like other SQL languages.
Example:
spark-sql> select array_position(array(1,2,3,4,5,2),2); array_position(array(1, 2, 3, 4, 5, 2), 2) 2
Element 2 appears twice in the array and the first index (position 2) is returned.
array_remove
Function array_remove removes element from the array.
Example:
spark-sql> select array_remove(array(1,2,3,4,5,2),2); array_remove(array(1, 2, 3, 4, 5, 2), 2) [1,3,4,5]
array_max and array_min
These two functions return maximum and minimum values in the array respectively. NULL values will be skipped.
Example:
spark-sql> select array_max(array(1,2,3,4,5,2)),array_min(array(1,2,3,4,5,2)); array_max(array(1, 2, 3, 4, 5, 2)) array_min(array(1, 2, 3, 4, 5, 2)) 5 1
array_reverse
Function array_reverse is used to reverse an array.
Example:
spark-sql> select reverse('abcde'); reverse(abcde) edcba
*Note: string is an array of chars.
slice
Function slice(x, start, length) extract a subset from array x starting from index start (array indices start at 1, or starting from the end if start is negative) with the specified length.
Example:
spark-sql> select slice(array(1,2,3,4,5,2),1,3); slice(array(1, 2, 3, 4, 5, 2), 1, 3) [1,2,3] spark-sql> select slice(array(1,2,3,4,5,2),-3,3); slice(array(1, 2, 3, 4, 5, 2), -3, 3) [4,5,2]
array_distinct
This function removes duplicates from an array.
Example:
spark-sql> select array_distinct(array(1,2,3,4,5,2)); array_distinct(array(1, 2, 3, 4, 5, 2)) [1,2,3,4,5]
array_union
This function unions two arrays without duplicates. The duplicates in each array itself will also be removed.
Examples:
spark-sql> select array_union(array(1,2,3,4,5,2),array(1,2,6)); array_union(array(1, 2, 3, 4, 5, 2), array(1, 2, 6)) [1,2,3,4,5,6]
array_except
Function array_except returns an array of the elements in the first array but not in the second, without duplicates.
Example:
spark-sql> select array_except(array(1,2,3,4,5,2),array(1,2,6)); array_except(array(1, 2, 3, 4, 5, 2), array(1, 2, 6)) [3,4,5]
As elements 1 and 2 both appear in the second array, the result excludes them.
array_intersect
This function returns the elements that exist in both arrays without duplicates.
Example:
spark-sql> select array_intersect(array(1,2,3,4,5,2),array(1,2,6)); array_intersect(array(1, 2, 3, 4, 5, 2), array(1, 2, 6)) [1,2]
arrays_overlap
Function arrays_overlap returns a boolean value or NULL. It returns true if the first array contains at least a non-null element that also is present in the second array. If the arrays have no common element and they are both non-empty and either of them contains a null element null is returned, false otherwise.
Examples:
spark-sql> select arrays_overlap(array(1,2,3,4,5,2),array(1,2,6)); arrays_overlap(array(1, 2, 3, 4, 5, 2), array(1, 2, 6)) true spark-sql> select arrays_overlap(array(1,2,3,4,5,2),array(6, NULL)); arrays_overlap(array(1, 2, 3, 4, 5, 2), array(6, CAST(NULL AS INT))) NULL
arrays_zip
This function zips two arrays. It returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.
Example:
spark-sql> select arrays_zip(array(1,2,3,4,5,2),array('a','b','c','d','e','f')); arrays_zip(array(1, 2, 3, 4, 5, 2), array(a, b, c, d, e, f)) [{"0":1,"1":"a"},{"0":2,"1":"b"},{"0":3,"1":"c"},{"0":4,"1":"d"},{"0":5,"1":"e"},{"0":2,"1":"f"}]