Spark SQL - Array Functions

event 2021-01-10 visibility 5,729 comment 0 insights
more_vert
insights Stats
Raymond Raymond Spark & PySpark

Apache Spark installation guides, performance tuning tips, general tutorials, etc.

*Spark logo is a registered trademark of Apache Spark.


Unlike traditional RDBMS systems, Spark SQL supports complex types like array or map. There are a number of built-in functions to operate efficiently on array values.

array, array_repeat and sequence

ArrayType columns can be created directly using array or array_repeat function. The latter repeat one element multiple times based on the input parameter.  Similarly as many data frameworks, sequence function is also available to construct an array, which generates an array of elements from start to stop (inclusive), incrementing by step. Supported types include byte, short, integer, long, date and timestamp. The start and stop expressions must resolve to the same type. If start and stop expressions resolve to the 'date' or 'timestamp' type then the step expression must resolve to the 'interval' type, otherwise to the same type as the start and stop expressions.

Example:

spark-sql> select array(1,2,3,4,5);
array(1, 2, 3, 4, 5)
[1,2,3,4,5]

spark-sql> select array_repeat(1,10);
array_repeat(1, 10)
[1,1,1,1,1,1,1,1,1,1]

spark-sql> select sequence(1, 100, 4);
sequence(1, 100, 4)
[1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77,81,85,89,93,97]

spark-sql> select sequence(DATE'2021-01-01', DATE'2021-12-31', interval 1 month);
sequence(DATE '2021-01-01', DATE '2021-12-31', INTERVAL '1 months')
[2021-01-01,2021-02-01,2021-03-01,2021-04-01,2021-05-01,2021-06-01,2021-07-01,2021-08-01,2021-09-01,2021-10-01,2021-11-01,2021-12-01]

shuffle

This function returns a random permutation of the input array.

Example:

spark-sql> select shuffle(array(1,2,3,4,5,2));
shuffle(array(1, 2, 3, 4, 5, 2))
[3,5,2,1,4,2]

sort_array

Use this function to sort an array in ascending or descending order according to the natural ordering of the array elements. Null elements will be placed at the beginning of the returned array in ascending order or at the end of the returned array in descending order. The second parameter is boolean:

  • true - ascending order (default value)
  • false - descending order

Example:

spark-sql> select sort_array(array(1,2,3,4,5,2),true);
sort_array(array(1, 2, 3, 4, 5, 2), true)
[1,2,2,3,4,5]

spark-sql> select sort_array(array(1,2,3,4,5,2),false);
sort_array(array(1, 2, 3, 4, 5, 2), false)
[5,4,3,2,2,1]

array_contains

Function array_contains is used to check whether value exists in an array.

Example:

spark-sql> select array_contains(array(1,2,3,4,5),1),array_contains(array(1,2,3,4,5),6);
array_contains(array(1, 2, 3, 4, 5), 1) array_contains(array(1, 2, 3, 4, 5), 6)
true    false

array_position

This function returns the index of the 1st element of the array. The index is 1-based like other SQL languages. 

Example:

spark-sql> select array_position(array(1,2,3,4,5,2),2);
array_position(array(1, 2, 3, 4, 5, 2), 2)
2

Element 2 appears twice in the array and the first index (position 2) is returned. 

array_remove

Function array_remove removes element from the array.

Example:

spark-sql> select array_remove(array(1,2,3,4,5,2),2);
array_remove(array(1, 2, 3, 4, 5, 2), 2)
[1,3,4,5]

array_max and array_min

These two functions return maximum and minimum values in the array respectively. NULL values will be skipped.

Example:

spark-sql> select array_max(array(1,2,3,4,5,2)),array_min(array(1,2,3,4,5,2));
array_max(array(1, 2, 3, 4, 5, 2))      array_min(array(1, 2, 3, 4, 5, 2))
5       1

array_reverse

Function array_reverse is used to reverse an array.

Example:

spark-sql> select reverse('abcde');
reverse(abcde)
edcba

*Note: string is an array of chars. 

slice

Function slice(x, start, length) extract a subset from array x starting from index start (array indices start at 1, or starting from the end if start is negative) with the specified length.

Example:

spark-sql> select slice(array(1,2,3,4,5,2),1,3);
slice(array(1, 2, 3, 4, 5, 2), 1, 3)
[1,2,3]

spark-sql> select slice(array(1,2,3,4,5,2),-3,3);
slice(array(1, 2, 3, 4, 5, 2), -3, 3)
[4,5,2]

array_distinct

This function removes duplicates from an array.

Example:

spark-sql> select array_distinct(array(1,2,3,4,5,2));
array_distinct(array(1, 2, 3, 4, 5, 2))
[1,2,3,4,5]

array_union

This function unions two arrays without duplicates. The duplicates in each array itself will also be removed. 

Examples:

spark-sql> select array_union(array(1,2,3,4,5,2),array(1,2,6));
array_union(array(1, 2, 3, 4, 5, 2), array(1, 2, 6))
[1,2,3,4,5,6]

array_except

Function array_except returns an array of the elements in the first array but not in the second, without duplicates.

Example:

spark-sql> select array_except(array(1,2,3,4,5,2),array(1,2,6));
array_except(array(1, 2, 3, 4, 5, 2), array(1, 2, 6))
[3,4,5]

As elements 1 and 2 both appear in the second array, the result excludes them.

array_intersect

This function returns the elements that exist in both arrays without duplicates.

Example:

spark-sql> select array_intersect(array(1,2,3,4,5,2),array(1,2,6));
array_intersect(array(1, 2, 3, 4, 5, 2), array(1, 2, 6))
[1,2]

arrays_overlap

Function arrays_overlap returns a boolean value or NULL. It returns true if the first array contains at least a non-null element that also is present in the second array. If the arrays have no common element and they are both non-empty and either of them contains a null element null is returned, false otherwise.

Examples:

spark-sql> select arrays_overlap(array(1,2,3,4,5,2),array(1,2,6));
arrays_overlap(array(1, 2, 3, 4, 5, 2), array(1, 2, 6))
true

spark-sql> select arrays_overlap(array(1,2,3,4,5,2),array(6, NULL));
arrays_overlap(array(1, 2, 3, 4, 5, 2), array(6, CAST(NULL AS INT)))
NULL

arrays_zip

This function zips two arrays. It returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.

Example:

spark-sql> select arrays_zip(array(1,2,3,4,5,2),array('a','b','c','d','e','f'));
arrays_zip(array(1, 2, 3, 4, 5, 2), array(a, b, c, d, e, f))
[{"0":1,"1":"a"},{"0":2,"1":"b"},{"0":3,"1":"c"},{"0":4,"1":"d"},{"0":5,"1":"e"},{"0":2,"1":"f"}]
More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts