spark-sql-function

46 items tagged with "spark-sql-function"

46 Articles

Articles

PySpark DataFrame - Convert JSON Column to Row using json_tuple

PySpark SQL functions json_tuple can be used to convert DataFrame JSON string columns to tuples (new rows in the DataFrame). Syntax of this function looks like the following: `` pyspark.sql.functions.json_tuple(col, *fields) ` The first parameter is the JSON string column name in the DataFrame and the second is the filed name list to extract. If you need to extract complex JSON documents like JSON arrays, you can follow this article - PySpark: Convert JSON String Column to Array of Object (StructType) in DataFrame. Output ` StructType([StructField('id', LongType(), True), StructField('c0', StringType(), True), StructField('c1', StringType(), True), StructField('c2', StringType(), True)]) +---+---+------+----------+ | id| c0| c1| c2| +---+---+------+----------+ | 1| 1|10.201|2021-01-01| | 2| 2|20.201|2022-01-01| +---+---+------+----------+ ``

2022-08-16
Code Snippets & Tips

PySpark DataFrame - Extract JSON Value using get_json_object Function

PySpark SQL functions getjsonobject can be used to extract JSON values from a JSON string column in Spark DataFrame. This is equivalent as using Spark SQL directly: Spark SQL - Extract Value from JSON String. Syntax of this function looks like the following: `` pyspark.sql.functions.getjsonobject(col, path) ` The first parameter is the JSON string column name in the DataFrame and the second is the JSON path. This code snippet shows you how to extract JSON values using JSON path. If you need to extract complex JSON documents like JSON arrays, you can follow this article - PySpark: Convert JSON String Column to Array of Object (StructType) in DataFrame. Output ` StructType([StructField('id', LongType(), True), StructField('jsoncol', StringType(), True), StructField('ATTRINT0', StringType(), True), StructField('ATTRDATE_1', StringType(), True)]) +---+--------------------+----------+-----------+ | id| jsoncol|ATTRINT0|ATTRDATE_1| +---+--------------------+----------+-----------+ | 1|[{"Attr_INT":1, "...| 1| 2022-01-01| +---+--------------------+----------+-----------+ ``

2022-08-16
Code Snippets & Tips

Replace Values via regexp_replace Function in PySpark DataFrame

PySpark SQL APIs provides regexp_replace built-in function to replace string values that match with the specified regular expression. It takes three parameters: the input column of the DataFrame, regular expression and the replacement for matches. `` pyspark.sql.functions.regexp_replace(str, pattern, replacement) ` Output The following is the output from this code snippet: ` +--------------+-------+----------------+ | strcol|intcol|strcolreplaced| +--------------+-------+----------------+ |Hello Kontext!| 100| Hello kontext!| |Hello Context!| 100| Hello kontext!| +--------------+-------+----------------+ `` All uppercase 'K' or 'C' are replaced with lowercase 'k'.

2022-08-16
Code Snippets & Tips

Spark SQL - window Function

Spark SQL has built-in function window to bucketize rows into one or more time windows given a timestamp specifying column. The syntax of the function looks like the following: window(timeColumn: ColumnOrName, windowDuration: str, slideDuration: Optional[str] = None, startTime: Optional[str] = None) This function is available from Spark 2.0.0. slideDuration must be less than or equal to windowDuration. \*These SQL statements can be directly used in PySpark DataFrame APIs too via spark.sql function. This code snippet prints out the following outputs: Query 1: `` 2022-08-01 12:01:00 {"start":2022-08-01 12:00:00,"end":2022-08-01 12:30:00} 2022-08-01 12:15:00 {"start":2022-08-01 12:00:00,"end":2022-08-01 12:30:00} 2022-08-01 12:31:01 {"start":2022-08-01 12:30:00,"end":2022-08-01 13:00:00} ` The first two rows are in the same window [00:00, 00:30). Query 2: ` 2022-08-01 12:01:00 {"start":2022-08-01 12:00:00,"end":2022-08-01 12:30:00}2022-08-01 12:01:00 {"start":2022-08-01 11:45:00,"end":2022-08-01 12:15:00}2022-08-01 12:15:00 {"start":2022-08-01 12:15:00,"end":2022-08-01 12:45:00}2022-08-01 12:15:00 {"start":2022-08-01 12:00:00,"end":2022-08-01 12:30:00}2022-08-01 12:31:01 {"start":2022-08-01 12:30:00,"end":2022-08-01 13:00:00}2022-08-01 12:31:01 {"start":2022-08-01 12:15:00,"end":2022-08-01 12:45:00} ``

2022-08-16
Code Snippets & Tips

Spark SQL - session_window Function

Spark SQL has built-in function session_window to create a window column based on a timestamp column and gap duration. The syntax of the function looks like the following: session\_window(timeColumn: ColumnOrName, gapDuration: [pyspark.sql.column.Column, str]) This function is available from Spark 3.2.0. \*These SQL statements can be directly used in PySpark DataFrame APIs too via spark.sql function. This code snippet prints out the following output: `` 2022-08-01 12:01:00 {"start":2022-08-01 12:01:00,"end":2022-08-01 12:31:00} 2022-08-01 12:15:00 {"start":2022-08-01 12:15:00,"end":2022-08-01 12:45:00} 2022-08-01 12:31:01 {"start":2022-08-01 12:31:01,"end":2022-08-01 13:01:01} ``

2022-08-16
Code Snippets & Tips

Spark SQL - Left and Right Padding (lpad and rpad) Functions

2022-07-09
Code Snippets & Tips

Spark SQL - Check if String Contains a String

2022-07-09
Code Snippets & Tips

Spark SQL - isnull and isnotnull Functions

2022-07-09
Code Snippets & Tips

Spark SQL - Concatenate w/o Separator (concat_ws and concat)

2022-07-09
Code Snippets & Tips

Spark SQL - Create Map from Arrays via map_from_arrays Function

2022-07-09
Code Snippets & Tips

Spark Hash Functions Introduction - MD5 and SHA

2022-06-16
Spark & PySpark

Spark SQL - Get Next Monday, Tuesday, Wednesday, Thursday, etc.

2022-06-16
Code Snippets & Tips

Spark SQL - Make Date, Timestamp and Intervals

2022-06-16
Code Snippets & Tips

Spark SQL - Get Current Timezone

2022-06-16
Code Snippets & Tips

Spark SQL - Date and Timestamp Truncate Functions

2022-06-15
Code Snippets & Tips

Spark SQL - Extract Day, Month, Year and other Part from Date or Timestamp

2022-06-15
Code Snippets & Tips

Spark SQL - Add Day, Month and Year to Date

2022-06-14
Code Snippets & Tips

Spark SQL - Return JSON Array Length (json_array_length)

2022-06-05
Code Snippets & Tips

Spark SQL - Return JSON Object Keys (json_object_keys)

2022-06-05
Code Snippets & Tips

Spark SQL - Conversion between UTC and Timestamp with Time Zone

2022-06-04
Code Snippets & Tips

Spark SQL - Date/Timestamp Conversation from/to UNIX Date/Timestamp

2022-06-04
Code Snippets & Tips

Spark SQL - Convert Date/Timestamp to String via date_format Function

2022-06-04
Code Snippets & Tips

Spark SQL - Convert Delimited String to Map using str_to_map Function

2022-06-04
Code Snippets & Tips

Spark SQL - element_at Function

2022-06-04
Code Snippets & Tips

Spark SQL - flatten Function

2022-05-31
Code Snippets & Tips

Spark SQL - PERCENT_RANK Window Function

2021-10-18
Spark & PySpark

Spark SQL - Date Difference in Seconds, Minutes, Hours

2021-10-12
Spark & PySpark

Spark "ROW_ID"

2021-05-16
Code Snippets & Tips

Spark SQL - PIVOT Clause

2021-01-10
Spark & PySpark

Spark SQL - Calculate Covariance

2021-01-10
Code Snippets & Tips

Spark SQL - Standard Deviation Calculation

2021-01-10
Code Snippets & Tips

Spark SQL - FIRST_VALUE or LAST_VALUE

2021-01-10
Code Snippets & Tips

Spark SQL - Array Functions

2021-01-10
Spark & PySpark

Spark SQL - Map Functions

2021-01-09
Spark & PySpark

Spark SQL - Convert Object to JSON String

2021-01-09
Code Snippets & Tips

Spark SQL - Extract Value from JSON String

2021-01-09
Code Snippets & Tips

Spark SQL - Convert JSON String to Map

2021-01-09
Spark & PySpark

Spark SQL - Convert String to Timestamp

2021-01-09
Spark & PySpark

Spark SQL - UNIX timestamp functions

2021-01-09
Spark & PySpark

Spark SQL - Date and Timestamp Function

2021-01-09
Spark & PySpark

Spark SQL - LEAD Window Function

2021-01-06
Spark & PySpark

Spark SQL - LAG Window Function

2021-01-06
Spark & PySpark

Spark SQL - NTILE Window Function

2021-01-06
Spark & PySpark

Spark SQL - DENSE_RANK Window Function

2021-01-06
Spark & PySpark

Spark SQL - RANK Window Function

2021-01-03
Spark & PySpark

Spark SQL - ROW_NUMBER Window Functions

2020-12-31
Spark & PySpark