spark-sql

55 items tagged with "spark-sql"

48 Articles
7 Diagrams

Articles

Use expr() Function in PySpark DataFrame

Spark SQL function expr() can be used to evaluate a SQL expression and returns as a column (pyspark.sql.column.Column). Any operators or functions that can be used in Spark SQL can also be used with DataFrame operations. This code snippet provides an example of using expr() function directly with DataFrame. It also includes the snippet to derive a column without using this function. \* The code snippet assumes a SparkSession object already exists as 'spark'. Output: `` +---+-----+-----+ | id|idv1|idv2| +---+-----+-----+ | 1| 11| 11| | 2| 12| 12| | 3| 13| 13| | 4| 14| 14| | 5| 15| 15| | 6| 16| 16| | 7| 17| 17| | 8| 18| 18| | 9| 19| 19| +---+-----+-----+ ``

2022-09-02
Code Snippets & Tips

Use Spark SQL Partitioning Hints

2022-08-21
Spark & PySpark

Spark Join Strategy Hints for SQL Queries

2022-08-21
Spark & PySpark

Concatenate Columns in Spark DataFrame

This code snippet provides one example of concatenating columns using a separator in Spark DataFrame. Function concatws is used directly. For Spark SQL version, refer to Spark SQL - Concatenate w/o Separator (concat\ws and concat). Syntax of concat\_ws `` pyspark.sql.functions.concat_ws(sep: str, *cols: ColumnOrName) ` Output: ` +-----+--------+--------------+ | col1| col2| col1_col2| +-----+--------+--------------+ |Hello| Kontext| Hello,Kontext| |Hello|Big Data|Hello,Big Data| +-----+--------+--------------+ ``

2022-08-19
Code Snippets & Tips

PySpark DataFrame - Calculate Distinct Count of Column(s)

This code snippet provides an example of calculating distinct count of values in PySpark DataFrame using countDistinct PySpark SQL function. Output: `` +---+-----+ | ID|Value| +---+-----+ |101| 56| |101| 67| |102| 70| |103| 93| |104| 70| +---+-----+ +-----------------+------------------+ |DistinctCountOfID|DistinctCountOfRow| +-----------------+------------------+ | 4| 5| +-----------------+------------------+ ``

2022-08-19
Code Snippets & Tips

PySpark DataFrame - Calculate sum and avg with groupBy

This code snippet provides an example of calculating aggregated values after grouping data in PySpark DataFrame. To group data, DataFrame.groupby or DataFrame.groupBy can be used; then GroupedData.agg method can be used to aggregate data for each group. Built-in aggregation functions like sum, avg, max, min and others can be used. Customized aggregation functions can also be used. Output: `` +----------+--------+ |TotalScore|AvgScore| +----------+--------+ | 392| 78.4| +----------+--------+ ``

2022-08-19
Code Snippets & Tips

PySpark DataFrame - percent_rank() Function

In Spark SQL, PERCENT\RANK(Spark SQL - PERCENT\RANK Window Function). This code snippet implements percentile ranking (relative ranking) directly using PySpark DataFrame percent_rank API instead of Spark SQL. Output: `` +-------+-----+------------------+ |Student|Score| percent_rank| +-------+-----+------------------+ | 101| 56| 0.0| | 109| 66|0.1111111111111111| | 103| 70|0.2222222222222222| | 110| 73|0.3333333333333333| | 107| 75|0.4444444444444444| | 102| 78|0.5555555555555556| | 108| 81|0.6666666666666666| | 104| 93|0.7777777777777778| | 105| 95|0.8888888888888888| | 106| 95|0.8888888888888888| +-------+-----+------------------+ ``

2022-08-18
Code Snippets & Tips

PySpark DataFrame - rank() and dense_rank() Functions

In Spark SQL, rank and denserank functions can be used to rank the rows within a window partition. In Spark SQL, we can use RANK(Spark SQL - RANK Window Function) and DENSE\RANK(Spark SQL - DENSE\RANK Window Function). This code snippet implements ranking directly using PySpark DataFrame APIs instead of Spark SQL. It created a window that partitions the data by TXN_DT attribute and sorts the records in each partition via AMT column in descending order. The frame boundary of the window is defined as unbounded preceding and current row. Output: `` +----+------+-------------------+----+----------+ |ACCT| AMT| TXNDT|rank|denserank| +----+------+-------------------+----+----------+ | 101|102.01|2021-01-01 00:00:00| 1| 1| | 102| 93.0|2021-01-01 00:00:00| 2| 2| | 101| 10.01|2021-01-01 00:00:00| 3| 3| | 103| 913.1|2021-01-02 00:00:00| 1| 1| | 101|900.56|2021-01-03 00:00:00| 1| 1| | 102|900.56|2021-01-03 00:00:00| 1| 1| | 103| 80.0|2021-01-03 00:00:00| 3| 2| +----+------+-------------------+----+----------+ ` As printed out, the difference between dense_rank and rank `is that the former will not generate any gaps if the ranked values are the same for multiple rows.

2022-08-18
Code Snippets & Tips

PySpark DataFrame - Add Row Number via row_number() Function

In Spark SQL, rownumber can be used to generate a series of sequential number starting from 1 for each record in the specified window. Examples can be found in this page: Spark SQL - ROW\NUMBER Window Functions. This code snippet provides the same approach to implement rownumber directly using PySpark DataFrame APIs instead of Spark SQL. It created a window that partitions the data by ACCT attribute and sorts the records in each partition via TXNDT column in descending order. The frame boundary of the window is defined as unbounded preceding and current row. Output: `` +----+------+-------------------+ |ACCT| AMT| TXN_DT| +----+------+-------------------+ | 101| 10.01|2021-01-01 00:00:00| | 101|102.01|2021-01-01 00:00:00| | 102| 93.0|2021-01-01 00:00:00| | 103| 913.1|2021-01-02 00:00:00| | 101|900.56|2021-01-03 00:00:00| +----+------+-------------------+ +----+------+-------------------+------+ |ACCT| AMT| TXN_DT|rownum| +----+------+-------------------+------+ | 101|900.56|2021-01-03 00:00:00| 1| | 101| 10.01|2021-01-01 00:00:00| 2| | 101|102.01|2021-01-01 00:00:00| 3| | 102| 93.0|2021-01-01 00:00:00| 1| | 103| 913.1|2021-01-02 00:00:00| 1| +----+------+-------------------+------+ ``

2022-08-18
Code Snippets & Tips

Introduction to PySpark ArrayType and MapType

2022-08-18
Spark & PySpark

Introduction to PySpark StructType and StructField

2022-08-17
Spark & PySpark

Spark SQL - window Function

Spark SQL has built-in function window to bucketize rows into one or more time windows given a timestamp specifying column. The syntax of the function looks like the following: window(timeColumn: ColumnOrName, windowDuration: str, slideDuration: Optional[str] = None, startTime: Optional[str] = None) This function is available from Spark 2.0.0. slideDuration must be less than or equal to windowDuration. \*These SQL statements can be directly used in PySpark DataFrame APIs too via spark.sql function. This code snippet prints out the following outputs: Query 1: `` 2022-08-01 12:01:00 {"start":2022-08-01 12:00:00,"end":2022-08-01 12:30:00} 2022-08-01 12:15:00 {"start":2022-08-01 12:00:00,"end":2022-08-01 12:30:00} 2022-08-01 12:31:01 {"start":2022-08-01 12:30:00,"end":2022-08-01 13:00:00} ` The first two rows are in the same window [00:00, 00:30). Query 2: ` 2022-08-01 12:01:00 {"start":2022-08-01 12:00:00,"end":2022-08-01 12:30:00}2022-08-01 12:01:00 {"start":2022-08-01 11:45:00,"end":2022-08-01 12:15:00}2022-08-01 12:15:00 {"start":2022-08-01 12:15:00,"end":2022-08-01 12:45:00}2022-08-01 12:15:00 {"start":2022-08-01 12:00:00,"end":2022-08-01 12:30:00}2022-08-01 12:31:01 {"start":2022-08-01 12:30:00,"end":2022-08-01 13:00:00}2022-08-01 12:31:01 {"start":2022-08-01 12:15:00,"end":2022-08-01 12:45:00} ``

2022-08-16
Code Snippets & Tips

Spark SQL - session_window Function

Spark SQL has built-in function session_window to create a window column based on a timestamp column and gap duration. The syntax of the function looks like the following: session\_window(timeColumn: ColumnOrName, gapDuration: [pyspark.sql.column.Column, str]) This function is available from Spark 3.2.0. \*These SQL statements can be directly used in PySpark DataFrame APIs too via spark.sql function. This code snippet prints out the following output: `` 2022-08-01 12:01:00 {"start":2022-08-01 12:01:00,"end":2022-08-01 12:31:00} 2022-08-01 12:15:00 {"start":2022-08-01 12:15:00,"end":2022-08-01 12:45:00} 2022-08-01 12:31:01 {"start":2022-08-01 12:31:01,"end":2022-08-01 13:01:01} ``

2022-08-16
Code Snippets & Tips

PySpark - Flatten (Explode) Nested StructType Column

2022-07-09
Spark & PySpark

PySpark - Read and Parse Apache Access Log Text Files

2022-07-09
Spark & PySpark

Spark SQL - Get Next Monday, Tuesday, Wednesday, Thursday, etc.

2022-06-16
Code Snippets & Tips

Spark SQL - Make Date, Timestamp and Intervals

2022-06-16
Code Snippets & Tips

Spark SQL - Get Current Timezone

2022-06-16
Code Snippets & Tips

Spark SQL - Date and Timestamp Truncate Functions

2022-06-15
Code Snippets & Tips

Spark SQL - Extract Day, Month, Year and other Part from Date or Timestamp

2022-06-15
Code Snippets & Tips

Spark SQL - Add Day, Month and Year to Date

2022-06-14
Code Snippets & Tips

Spark SQL - Convert String to Int (int and cast function)

2022-06-04
Code Snippets & Tips

Spark SQL - Literals (Constants)

2022-05-31
Spark & PySpark

Spark SQL Joins with Examples

2022-05-31
Spark & PySpark

Spark SQL - PERCENT_RANK Window Function

2021-10-18
Spark & PySpark

Spark SQL - Date Difference in Seconds, Minutes, Hours

2021-10-12
Spark & PySpark

Spark SQL - Average (AVG) Calculation

2021-09-25
Code Snippets & Tips

Spark SQL - Group By

2021-09-24
Code Snippets & Tips

Spark SQL - PIVOT Clause

2021-01-10
Spark & PySpark

Spark SQL - Calculate Covariance

2021-01-10
Code Snippets & Tips

Spark SQL - Standard Deviation Calculation

2021-01-10
Code Snippets & Tips

Spark SQL - FIRST_VALUE or LAST_VALUE

2021-01-10
Code Snippets & Tips

Spark SQL - Array Functions

2021-01-10
Spark & PySpark

Spark SQL - Map Functions

2021-01-09
Spark & PySpark

Spark SQL - Convert Object to JSON String

2021-01-09
Code Snippets & Tips

Spark SQL - Extract Value from JSON String

2021-01-09
Code Snippets & Tips

Spark SQL - Convert JSON String to Map

2021-01-09
Spark & PySpark

Spark SQL - Convert String to Timestamp

2021-01-09
Spark & PySpark

Spark SQL - UNIX timestamp functions

2021-01-09
Spark & PySpark

Spark SQL - Date and Timestamp Function

2021-01-09
Spark & PySpark

Spark SQL - LEAD Window Function

2021-01-06
Spark & PySpark

Spark SQL - LAG Window Function

2021-01-06
Spark & PySpark

Spark SQL - NTILE Window Function

2021-01-06
Spark & PySpark

Spark SQL - DENSE_RANK Window Function

2021-01-06
Spark & PySpark

Spark SQL - RANK Window Function

2021-01-03
Spark & PySpark

Spark SQL - ROW_NUMBER Window Functions

2020-12-31
Spark & PySpark

Show Headings (Column Names) in spark-sql CLI Result

2020-12-28
Spark & PySpark

Spark SQL - Construct Virtual Table in Memory using Values

2020-12-27
Code Snippets & Tips

Diagrams

Spark SQL Joins - Cross Join (Cartesian Product)

This diagram shows Cross Join type in Spark SQL. It returns the Cartesian product of two tables (relations). References JOIN - Spark 3.2.1 Documentation (apache.org)

2022-05-31
Kontext's Project

Spark SQL Joins - Left Anti Join

This diagram shows Left Anti Join type in Spark SQL. An anti join returns returns values from the left relation that has no match with the right. It is also called left anti join. References JOIN - Spark 3.2.1 Documentation (apache.org)

2022-05-31
Kontext's Project

Spark SQL Joins - Left Semi Join

This diagram shows Left Semi Join type in Spark SQL. A semi join returns values from the left side of the relation that has a match with the right. It is also called left semi join. References JOIN - Spark 3.2.1 Documentation (apache.org)

2022-05-31
Kontext's Project

Spark SQL Joins - Full Outer Join

This diagram shows Full Join type in Spark SQL. It returns all values from both relations, appending NULL values on the side that does not have a match. It is also called full outer join. References JOIN - Spark 3.2.1 Documentation (apache.org)

2022-05-31
Kontext's Project

Spark SQL Joins - Right Outer Join

This diagram shows Right Join type in Spark SQL. It returns all values from the right relation and the matched values from the left relation, or appends NULL if there is no match. It is also called right outer join. References JOIN - Spark 3.2.1 Documentation (apache.org)

2022-05-31
Kontext's Project

Spark SQL Joins - Left Outer Join

This diagram shows Left Join type in Spark SQL. It returns all values from the left relation and the matched values from the right relation, or appends NULL if there is no match. It is also called left outer join. References JOIN - Spark 3.2.1 Documentation (apache.org)

2022-05-31
Kontext's Project

Spark SQL Joins - Inner Join

This diagram shows Inner Join type in Spark SQL. It returns rows that have matching values in both tables (relations). References JOIN - Spark 3.2.1 Documentation (apache.org)

2022-05-31
Kontext's Project