spark-sql
55 items tagged with "spark-sql"
Articles
Use expr() Function in PySpark DataFrame
Spark SQL function expr() can be used to evaluate a SQL expression and returns as a column (pyspark.sql.column.Column). Any operators or functions that can be used in Spark SQL can also be used with DataFrame operations. This code snippet provides an example of using expr() function directly with DataFrame. It also includes the snippet to derive a column without using this function. \* The code snippet assumes a SparkSession object already exists as 'spark'. Output: `` +---+-----+-----+ | id|idv1|idv2| +---+-----+-----+ | 1| 11| 11| | 2| 12| 12| | 3| 13| 13| | 4| 14| 14| | 5| 15| 15| | 6| 16| 16| | 7| 17| 17| | 8| 18| 18| | 9| 19| 19| +---+-----+-----+ ``
Use Spark SQL Partitioning Hints
Spark Join Strategy Hints for SQL Queries
Concatenate Columns in Spark DataFrame
This code snippet provides one example of concatenating columns using a separator in Spark DataFrame. Function concatws is used directly. For Spark SQL version, refer to Spark SQL - Concatenate w/o Separator (concat\ws and concat). Syntax of concat\_ws `` pyspark.sql.functions.concat_ws(sep: str, *cols: ColumnOrName) ` Output: ` +-----+--------+--------------+ | col1| col2| col1_col2| +-----+--------+--------------+ |Hello| Kontext| Hello,Kontext| |Hello|Big Data|Hello,Big Data| +-----+--------+--------------+ ``
PySpark DataFrame - Calculate Distinct Count of Column(s)
This code snippet provides an example of calculating distinct count of values in PySpark DataFrame using countDistinct PySpark SQL function. Output: `` +---+-----+ | ID|Value| +---+-----+ |101| 56| |101| 67| |102| 70| |103| 93| |104| 70| +---+-----+ +-----------------+------------------+ |DistinctCountOfID|DistinctCountOfRow| +-----------------+------------------+ | 4| 5| +-----------------+------------------+ ``
PySpark DataFrame - Calculate sum and avg with groupBy
This code snippet provides an example of calculating aggregated values after grouping data in PySpark DataFrame. To group data, DataFrame.groupby or DataFrame.groupBy can be used; then GroupedData.agg method can be used to aggregate data for each group. Built-in aggregation functions like sum, avg, max, min and others can be used. Customized aggregation functions can also be used. Output: `` +----------+--------+ |TotalScore|AvgScore| +----------+--------+ | 392| 78.4| +----------+--------+ ``
PySpark DataFrame - percent_rank() Function
In Spark SQL, PERCENT\RANK(Spark SQL - PERCENT\RANK Window Function). This code snippet implements percentile ranking (relative ranking) directly using PySpark DataFrame percent_rank API instead of Spark SQL. Output: `` +-------+-----+------------------+ |Student|Score| percent_rank| +-------+-----+------------------+ | 101| 56| 0.0| | 109| 66|0.1111111111111111| | 103| 70|0.2222222222222222| | 110| 73|0.3333333333333333| | 107| 75|0.4444444444444444| | 102| 78|0.5555555555555556| | 108| 81|0.6666666666666666| | 104| 93|0.7777777777777778| | 105| 95|0.8888888888888888| | 106| 95|0.8888888888888888| +-------+-----+------------------+ ``
PySpark DataFrame - rank() and dense_rank() Functions
In Spark SQL, rank and denserank functions can be used to rank the rows within a window partition. In Spark SQL, we can use RANK(Spark SQL - RANK Window Function) and DENSE\RANK(Spark SQL - DENSE\RANK Window Function). This code snippet implements ranking directly using PySpark DataFrame APIs instead of Spark SQL. It created a window that partitions the data by TXN_DT attribute and sorts the records in each partition via AMT column in descending order. The frame boundary of the window is defined as unbounded preceding and current row. Output: `` +----+------+-------------------+----+----------+ |ACCT| AMT| TXNDT|rank|denserank| +----+------+-------------------+----+----------+ | 101|102.01|2021-01-01 00:00:00| 1| 1| | 102| 93.0|2021-01-01 00:00:00| 2| 2| | 101| 10.01|2021-01-01 00:00:00| 3| 3| | 103| 913.1|2021-01-02 00:00:00| 1| 1| | 101|900.56|2021-01-03 00:00:00| 1| 1| | 102|900.56|2021-01-03 00:00:00| 1| 1| | 103| 80.0|2021-01-03 00:00:00| 3| 2| +----+------+-------------------+----+----------+ ` As printed out, the difference between dense_rank and rank `is that the former will not generate any gaps if the ranked values are the same for multiple rows.
PySpark DataFrame - Add Row Number via row_number() Function
In Spark SQL, rownumber can be used to generate a series of sequential number starting from 1 for each record in the specified window. Examples can be found in this page: Spark SQL - ROW\NUMBER Window Functions. This code snippet provides the same approach to implement rownumber directly using PySpark DataFrame APIs instead of Spark SQL. It created a window that partitions the data by ACCT attribute and sorts the records in each partition via TXNDT column in descending order. The frame boundary of the window is defined as unbounded preceding and current row. Output: `` +----+------+-------------------+ |ACCT| AMT| TXN_DT| +----+------+-------------------+ | 101| 10.01|2021-01-01 00:00:00| | 101|102.01|2021-01-01 00:00:00| | 102| 93.0|2021-01-01 00:00:00| | 103| 913.1|2021-01-02 00:00:00| | 101|900.56|2021-01-03 00:00:00| +----+------+-------------------+ +----+------+-------------------+------+ |ACCT| AMT| TXN_DT|rownum| +----+------+-------------------+------+ | 101|900.56|2021-01-03 00:00:00| 1| | 101| 10.01|2021-01-01 00:00:00| 2| | 101|102.01|2021-01-01 00:00:00| 3| | 102| 93.0|2021-01-01 00:00:00| 1| | 103| 913.1|2021-01-02 00:00:00| 1| +----+------+-------------------+------+ ``
Introduction to PySpark ArrayType and MapType
Introduction to PySpark StructType and StructField
Spark SQL - window Function
Spark SQL has built-in function window to bucketize rows into one or more time windows given a timestamp specifying column. The syntax of the function looks like the following: window(timeColumn: ColumnOrName, windowDuration: str, slideDuration: Optional[str] = None, startTime: Optional[str] = None) This function is available from Spark 2.0.0. slideDuration must be less than or equal to windowDuration. \*These SQL statements can be directly used in PySpark DataFrame APIs too via spark.sql function. This code snippet prints out the following outputs: Query 1: `` 2022-08-01 12:01:00 {"start":2022-08-01 12:00:00,"end":2022-08-01 12:30:00} 2022-08-01 12:15:00 {"start":2022-08-01 12:00:00,"end":2022-08-01 12:30:00} 2022-08-01 12:31:01 {"start":2022-08-01 12:30:00,"end":2022-08-01 13:00:00} ` The first two rows are in the same window [00:00, 00:30). Query 2: ` 2022-08-01 12:01:00 {"start":2022-08-01 12:00:00,"end":2022-08-01 12:30:00}2022-08-01 12:01:00 {"start":2022-08-01 11:45:00,"end":2022-08-01 12:15:00}2022-08-01 12:15:00 {"start":2022-08-01 12:15:00,"end":2022-08-01 12:45:00}2022-08-01 12:15:00 {"start":2022-08-01 12:00:00,"end":2022-08-01 12:30:00}2022-08-01 12:31:01 {"start":2022-08-01 12:30:00,"end":2022-08-01 13:00:00}2022-08-01 12:31:01 {"start":2022-08-01 12:15:00,"end":2022-08-01 12:45:00} ``
Spark SQL - session_window Function
Spark SQL has built-in function session_window to create a window column based on a timestamp column and gap duration. The syntax of the function looks like the following: session\_window(timeColumn: ColumnOrName, gapDuration: [pyspark.sql.column.Column, str]) This function is available from Spark 3.2.0. \*These SQL statements can be directly used in PySpark DataFrame APIs too via spark.sql function. This code snippet prints out the following output: `` 2022-08-01 12:01:00 {"start":2022-08-01 12:01:00,"end":2022-08-01 12:31:00} 2022-08-01 12:15:00 {"start":2022-08-01 12:15:00,"end":2022-08-01 12:45:00} 2022-08-01 12:31:01 {"start":2022-08-01 12:31:01,"end":2022-08-01 13:01:01} ``
PySpark - Flatten (Explode) Nested StructType Column
PySpark - Read and Parse Apache Access Log Text Files
Spark SQL - Get Next Monday, Tuesday, Wednesday, Thursday, etc.
Spark SQL - Make Date, Timestamp and Intervals
Spark SQL - Get Current Timezone
Spark SQL - Date and Timestamp Truncate Functions
Spark SQL - Extract Day, Month, Year and other Part from Date or Timestamp
Spark SQL - Add Day, Month and Year to Date
Spark SQL - Convert String to Int (int and cast function)
Spark SQL - Literals (Constants)
Spark SQL Joins with Examples
Spark SQL - PERCENT_RANK Window Function
Spark SQL - Date Difference in Seconds, Minutes, Hours
Spark SQL - Average (AVG) Calculation
Spark SQL - Group By
Spark SQL - PIVOT Clause
Spark SQL - Calculate Covariance
Spark SQL - Standard Deviation Calculation
Spark SQL - FIRST_VALUE or LAST_VALUE
Spark SQL - Array Functions
Spark SQL - Map Functions
Spark SQL - Convert Object to JSON String
Spark SQL - Extract Value from JSON String
Spark SQL - Convert JSON String to Map
Spark SQL - Convert String to Timestamp
Spark SQL - UNIX timestamp functions
Spark SQL - Date and Timestamp Function
Spark SQL - LEAD Window Function
Spark SQL - LAG Window Function
Spark SQL - NTILE Window Function
Spark SQL - DENSE_RANK Window Function
Spark SQL - RANK Window Function
Spark SQL - ROW_NUMBER Window Functions
Show Headings (Column Names) in spark-sql CLI Result
Spark SQL - Construct Virtual Table in Memory using Values
Diagrams
Spark SQL Joins - Cross Join (Cartesian Product)
This diagram shows Cross Join type in Spark SQL. It returns the Cartesian product of two tables (relations). References JOIN - Spark 3.2.1 Documentation (apache.org)
Spark SQL Joins - Left Anti Join
This diagram shows Left Anti Join type in Spark SQL. An anti join returns returns values from the left relation that has no match with the right. It is also called left anti join. References JOIN - Spark 3.2.1 Documentation (apache.org)
Spark SQL Joins - Left Semi Join
This diagram shows Left Semi Join type in Spark SQL. A semi join returns values from the left side of the relation that has a match with the right. It is also called left semi join. References JOIN - Spark 3.2.1 Documentation (apache.org)
Spark SQL Joins - Full Outer Join
This diagram shows Full Join type in Spark SQL. It returns all values from both relations, appending NULL values on the side that does not have a match. It is also called full outer join. References JOIN - Spark 3.2.1 Documentation (apache.org)
Spark SQL Joins - Right Outer Join
This diagram shows Right Join type in Spark SQL. It returns all values from the right relation and the matched values from the left relation, or appends NULL if there is no match. It is also called right outer join. References JOIN - Spark 3.2.1 Documentation (apache.org)
Spark SQL Joins - Left Outer Join
This diagram shows Left Join type in Spark SQL. It returns all values from the left relation and the matched values from the right relation, or appends NULL if there is no match. It is also called left outer join. References JOIN - Spark 3.2.1 Documentation (apache.org)
Spark SQL Joins - Inner Join
This diagram shows Inner Join type in Spark SQL. It returns rows that have matching values in both tables (relations). References JOIN - Spark 3.2.1 Documentation (apache.org)