spark-sql

Use Spark SQL Partitioning Hints

2022-08-21

Spark Join Strategy Hints for SQL Queries

2022-08-21

Concatenate Columns in Spark DataFrame

This code snippet provides one example of concatenating columns using a separator in Spark DataFrame. Function concatws is used directly. For Spark SQL version, refer to Spark SQL - Concatenate w/o Separator (concat\ws and concat). Syntax of concat\_ws `` pyspark.sql.functions.concat_ws(sep: str, *cols: ColumnOrName) ` Output: ` +-----+--------+--------------+ | col1| col2| col1_col2| +-----+--------+--------------+ |Hello| Kontext| Hello,Kontext| |Hello|Big Data|Hello,Big Data| +-----+--------+--------------+ ``

2022-08-19

PySpark DataFrame - Calculate Distinct Count of Column(s)

This code snippet provides an example of calculating distinct count of values in PySpark DataFrame using countDistinct PySpark SQL function. Output: `` +---+-----+ | ID|Value| +---+-----+ |101| 56| |101| 67| |102| 70| |103| 93| |104| 70| +---+-----+ +-----------------+------------------+ |DistinctCountOfID|DistinctCountOfRow| +-----------------+------------------+ | 4| 5| +-----------------+------------------+ ``

2022-08-19

PySpark DataFrame - Calculate sum and avg with groupBy

This code snippet provides an example of calculating aggregated values after grouping data in PySpark DataFrame. To group data, DataFrame.groupby or DataFrame.groupBy can be used; then GroupedData.agg method can be used to aggregate data for each group. Built-in aggregation functions like sum, avg, max, min and others can be used. Customized aggregation functions can also be used. Output: `` +----------+--------+ |TotalScore|AvgScore| +----------+--------+ | 392| 78.4| +----------+--------+ ``

2022-08-19

PySpark DataFrame - percent_rank() Function

In Spark SQL, PERCENT\RANK(Spark SQL - PERCENT\RANK Window Function). This code snippet implements percentile ranking (relative ranking) directly using PySpark DataFrame percent_rank API instead of Spark SQL. Output: `` +-------+-----+------------------+ |Student|Score| percent_rank| +-------+-----+------------------+ | 101| 56| 0.0| | 109| 66|0.1111111111111111| | 103| 70|0.2222222222222222| | 110| 73|0.3333333333333333| | 107| 75|0.4444444444444444| | 102| 78|0.5555555555555556| | 108| 81|0.6666666666666666| | 104| 93|0.7777777777777778| | 105| 95|0.8888888888888888| | 106| 95|0.8888888888888888| +-------+-----+------------------+ ``

PySpark DataFrame - rank() and dense_rank() Functions

In Spark SQL, rank and denserank functions can be used to rank the rows within a window partition. In Spark SQL, we can use RANK(Spark SQL - RANK Window Function) and DENSE\RANK(Spark SQL - DENSE\RANK Window Function). This code snippet implements ranking directly using PySpark DataFrame APIs instead of Spark SQL. It created a window that partitions the data by TXN_DT attribute and sorts the records in each partition via AMT column in descending order. The frame boundary of the window is defined as unbounded preceding and current row. Output: `` +----+------+-------------------+----+----------+ |ACCT| AMT| TXNDT|rank|denserank| +----+------+-------------------+----+----------+ | 101|102.01|2021-01-01 00:00:00| 1| 1| | 102| 93.0|2021-01-01 00:00:00| 2| 2| | 101| 10.01|2021-01-01 00:00:00| 3| 3| | 103| 913.1|2021-01-02 00:00:00| 1| 1| | 101|900.56|2021-01-03 00:00:00| 1| 1| | 102|900.56|2021-01-03 00:00:00| 1| 1| | 103| 80.0|2021-01-03 00:00:00| 3| 2| +----+------+-------------------+----+----------+ ` As printed out, the difference between dense_rank and rank `is that the former will not generate any gaps if the ranked values are the same for multiple rows.

PySpark DataFrame - Add Row Number via row_number() Function

In Spark SQL, rownumber can be used to generate a series of sequential number starting from 1 for each record in the specified window. Examples can be found in this page: Spark SQL - ROW\NUMBER Window Functions. This code snippet provides the same approach to implement rownumber directly using PySpark DataFrame APIs instead of Spark SQL. It created a window that partitions the data by ACCT attribute and sorts the records in each partition via TXNDT column in descending order. The frame boundary of the window is defined as unbounded preceding and current row. Output: `` +----+------+-------------------+ |ACCT| AMT| TXN_DT| +----+------+-------------------+ | 101| 10.01|2021-01-01 00:00:00| | 101|102.01|2021-01-01 00:00:00| | 102| 93.0|2021-01-01 00:00:00| | 103| 913.1|2021-01-02 00:00:00| | 101|900.56|2021-01-03 00:00:00| +----+------+-------------------+ +----+------+-------------------+------+ |ACCT| AMT| TXN_DT|rownum| +----+------+-------------------+------+ | 101|900.56|2021-01-03 00:00:00| 1| | 101| 10.01|2021-01-01 00:00:00| 2| | 101|102.01|2021-01-01 00:00:00| 3| | 102| 93.0|2021-01-01 00:00:00| 1| | 103| 913.1|2021-01-02 00:00:00| 1| +----+------+-------------------+------+ ``

Introduction to PySpark ArrayType and MapType

Introduction to PySpark StructType and StructField

2022-08-17

Spark SQL - window Function

Spark SQL has built-in function window to bucketize rows into one or more time windows given a timestamp specifying column. The syntax of the function looks like the following: window(timeColumn: ColumnOrName, windowDuration: str, slideDuration: Optional[str] = None, startTime: Optional[str] = None) This function is available from Spark 2.0.0. slideDuration must be less than or equal to windowDuration. \*These SQL statements can be directly used in PySpark DataFrame APIs too via spark.sql function. This code snippet prints out the following outputs: Query 1: `` 2022-08-01 12:01:00 {"start":2022-08-01 12:00:00,"end":2022-08-01 12:30:00} 2022-08-01 12:15:00 {"start":2022-08-01 12:00:00,"end":2022-08-01 12:30:00} 2022-08-01 12:31:01 {"start":2022-08-01 12:30:00,"end":2022-08-01 13:00:00} ` The first two rows are in the same window [00:00, 00:30). Query 2: ` 2022-08-01 12:01:00 {"start":2022-08-01 12:00:00,"end":2022-08-01 12:30:00}2022-08-01 12:01:00 {"start":2022-08-01 11:45:00,"end":2022-08-01 12:15:00}2022-08-01 12:15:00 {"start":2022-08-01 12:15:00,"end":2022-08-01 12:45:00}2022-08-01 12:15:00 {"start":2022-08-01 12:00:00,"end":2022-08-01 12:30:00}2022-08-01 12:31:01 {"start":2022-08-01 12:30:00,"end":2022-08-01 13:00:00}2022-08-01 12:31:01 {"start":2022-08-01 12:15:00,"end":2022-08-01 12:45:00} ``

2022-08-16

Spark SQL - session_window Function

Spark SQL has built-in function session_window to create a window column based on a timestamp column and gap duration. The syntax of the function looks like the following: session\_window(timeColumn: ColumnOrName, gapDuration: [pyspark.sql.column.Column, str]) This function is available from Spark 3.2.0. \*These SQL statements can be directly used in PySpark DataFrame APIs too via spark.sql function. This code snippet prints out the following output: `` 2022-08-01 12:01:00 {"start":2022-08-01 12:01:00,"end":2022-08-01 12:31:00} 2022-08-01 12:15:00 {"start":2022-08-01 12:15:00,"end":2022-08-01 12:45:00} 2022-08-01 12:31:01 {"start":2022-08-01 12:31:01,"end":2022-08-01 13:01:01} ``

2022-08-16

Diagrams

Spark SQL Joins - Cross Join (Cartesian Product)

This diagram shows Cross Join type in Spark SQL. It returns the Cartesian product of two tables (relations). References JOIN - Spark 3.2.1 Documentation (apache.org)

Spark SQL Joins - Left Anti Join

This diagram shows Left Anti Join type in Spark SQL. An anti join returns returns values from the left relation that has no match with the right. It is also called left anti join. References JOIN - Spark 3.2.1 Documentation (apache.org)

Spark SQL Joins - Left Semi Join

This diagram shows Left Semi Join type in Spark SQL. A semi join returns values from the left side of the relation that has a match with the right. It is also called left semi join. References JOIN - Spark 3.2.1 Documentation (apache.org)

Spark SQL Joins - Full Outer Join

This diagram shows Full Join type in Spark SQL. It returns all values from both relations, appending NULL values on the side that does not have a match. It is also called full outer join. References JOIN - Spark 3.2.1 Documentation (apache.org)

Spark SQL Joins - Right Outer Join

This diagram shows Right Join type in Spark SQL. It returns all values from the right relation and the matched values from the left relation, or appends NULL if there is no match. It is also called right outer join. References JOIN - Spark 3.2.1 Documentation (apache.org)

Spark SQL Joins - Left Outer Join

This diagram shows Left Join type in Spark SQL. It returns all values from the left relation and the matched values from the right relation, or appends NULL if there is no match. It is also called left outer join. References JOIN - Spark 3.2.1 Documentation (apache.org)

Spark SQL Joins - Inner Join

This diagram shows Inner Join type in Spark SQL. It returns rows that have matching values in both tables (relations). References JOIN - Spark 3.2.1 Documentation (apache.org)