Spark SQL - PIVOT Clause

access_time 5 days ago visibility10 comment 0

Like other SQL engines, Spark also supports PIVOT clause. PIVOT is usually used to calculated aggregated values for each value in a column and the calculated values will be included as columns in the result set.

Syntax for PIVOT clause

PIVOT ( { aggregate_expression [ AS aggregate_expression_alias ] } [ , ... ]
    FOR column_list IN ( expression_list ) )
  • aggregate_expression: an aggregate expression like MAX, MEAN, etc. 
  • aggregate_expression_alias: alias name for the aggregation expression.
  • column_list: column name list that we can to replace in the FROM clause. Note it can be multiple columns. 
  • expression_list: to specify new column names for the aggregated columns. Alias can also be used for these new columns use AS clause. 

Pivot one column

The following code snippet pivot one column named Product:

SELECT * FROM VALUES 
(101,'A',1000.01),
(101,'B',2000),
(101,'C',5000),
(102,'A',2000.01),
(102,'B',4000.1),
(103,'A',2000.01),
(103,'B',4000.1),
(101,'A',3000.01)
AS Sales(Employee,Product,Amount)
PIVOT (
SUM(Amount) AS amt, COUNT(Amount) AS cnt
FOR Product IN ( 'A' AS a, 'B' as b, 'C' AS c)
)
;

The output looks like the following:

Employee        a_amt   a_cnt   b_amt   b_cnt   c_amt   c_cnt
101     4000.02 2       2000.00 1       5000.00 1
103     2000.01 1       4000.10 1       NULL    NULL
102     2000.01 1       4000.10 1       NULL    NULL

Pivot two columns

You can pivot multiple columns at the same time in Spark SQL. For example, combinations of product and channels. 

Example:

SELECT * FROM VALUES 
(101,'A',1,1000.01),
(101,'B',0,2000),
(102,'A',0,2000.01),
(102,'B',1,4000.1),
(103,'A',0,2000.01),
(103,'B',1,4000.1),
(101,'A',1,3000.01)
AS Sales(Employee,Product,Online,Amount)
PIVOT (
SUM(Amount) AS amt
FOR (Product,Online) IN ( ('A',1) AS a_online, ('B',1) as b_online, ('A',0) AS a_other, ('B',0) as b_other)
)
;

Output:

Employee        a_online        b_online        a_other b_other
101     4000.02 NULL    NULL    2000.00
103     NULL    4000.10 2000.01 NULL
102     NULL    4000.10 2000.01 NULL
infoYou can also specify multiple aggregation functions in the above example. For simplicity, I only used SUM as the aggregation function. 
copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Follow Kontext

Get our latest updates on LinkedIn or Twitter.

Want to publish your article on Kontext?

Learn more

More from Kontext

visibility 18
thumb_up 0
access_time 19 days ago

In Spark-SQL CLI tool, the result print will omit headings (column names) by default. To display columns, we need to update Spark setting spark.hadoop.hive.cli.print.header. To make the changes for all spark-sql sessions, edit file $SPARK_HOME/conf/spark-defaults.conf . Add the following ...

visibility 10
thumb_up 0
access_time 5 days ago

Like other SQL engines, Spark also supports PIVOT clause. PIVOT is usually used to calculated aggregated values for each value in a column and the calculated values will be included as columns in the result set. PIVOT ( { aggregate_expression [ AS aggregate_expression_alias ] } [ , ... ] FOR ...

visibility 11
thumb_up 0
access_time 7 days ago

Similar as  Convert String to Date using Spark SQL , you can convert string of timestamp to Spark SQL timestamp data type. Function  to_timestamp(timestamp_str[, fmt]) p arses the `timestamp_str` expression with the `fmt` expression to a timestamp data type in Spark.  Example ...