Spark SQL - PIVOT Clause

event 2021-01-10 visibility 5,192 comment 0 insights
more_vert
insights Stats
Raymond Raymond Spark & PySpark

Apache Spark installation guides, performance tuning tips, general tutorials, etc.

*Spark logo is a registered trademark of Apache Spark.


Like other SQL engines, Spark also supports PIVOT clause. PIVOT is usually used to calculated aggregated values for each value in a column and the calculated values will be included as columns in the result set.

Syntax for PIVOT clause

PIVOT ( { aggregate_expression [ AS aggregate_expression_alias ] } [ , ... ]
    FOR column_list IN ( expression_list ) )
  • aggregate_expression: an aggregate expression like MAX, MEAN, etc. 
  • aggregate_expression_alias: alias name for the aggregation expression.
  • column_list: column name list that we can to replace in the FROM clause. Note it can be multiple columns. 
  • expression_list: to specify new column names for the aggregated columns. Alias can also be used for these new columns use AS clause. 

Pivot one column

The following code snippet pivot one column named Product:

SELECT * FROM VALUES 
(101,'A',1000.01),
(101,'B',2000),
(101,'C',5000),
(102,'A',2000.01),
(102,'B',4000.1),
(103,'A',2000.01),
(103,'B',4000.1),
(101,'A',3000.01)
AS Sales(Employee,Product,Amount)
PIVOT (
SUM(Amount) AS amt, COUNT(Amount) AS cnt
FOR Product IN ( 'A' AS a, 'B' as b, 'C' AS c)
)
;

The output looks like the following:

Employee        a_amt   a_cnt   b_amt   b_cnt   c_amt   c_cnt
101     4000.02 2       2000.00 1       5000.00 1
103     2000.01 1       4000.10 1       NULL    NULL
102     2000.01 1       4000.10 1       NULL    NULL

Pivot two columns

You can pivot multiple columns at the same time in Spark SQL. For example, combinations of product and channels. 

Example:

SELECT * FROM VALUES 
(101,'A',1,1000.01),
(101,'B',0,2000),
(102,'A',0,2000.01),
(102,'B',1,4000.1),
(103,'A',0,2000.01),
(103,'B',1,4000.1),
(101,'A',1,3000.01)
AS Sales(Employee,Product,Online,Amount)
PIVOT (
SUM(Amount) AS amt
FOR (Product,Online) IN ( ('A',1) AS a_online, ('B',1) as b_online, ('A',0) AS a_other, ('B',0) as b_other)
)
;

Output:

Employee        a_online        b_online        a_other b_other
101     4000.02 NULL    NULL    2000.00
103     NULL    4000.10 2000.01 NULL
102     NULL    4000.10 2000.01 NULL
infoYou can also specify multiple aggregation functions in the above example. For simplicity, I only used SUM as the aggregation function. 
More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts