Apache Spark installation guides, performance tuning tips, general tutorials, etc.
*Spark logo is a registered trademark of Apache Spark.
Spark SQL - PIVOT Clause
Like other SQL engines, Spark also supports PIVOT clause. PIVOT is usually used to calculated aggregated values for each value in a column and the calculated values will be included as columns in the result set. PIVOT ( { aggregate_expression [ AS aggregate_expression_alias ] } [ , ... ] FOR ...
Spark SQL - Array Functions
Unlike traditional RDBMS systems, Spark SQL supports complex types like array or map. There are a number of built-in functions to operate efficiently on array values. ArrayType columns can be created directly using array or array_repeat function. The latter repeat one element multiple times ...
Spark SQL - Map Functions
In Spark SQL, MapType is designed for key values, which is like dictionary object type in many other programming languages. This article summarize the commonly used map functions in Spark SQL. Function map is used to create a map. Example: spark-sql> select ...
Spark SQL - Convert JSON String to Map
Spark SQL function from_json(jsonStr, schema[, options]) returns a struct value with the given JSON string and format. Parameter options is used to control how the json is parsed. It accepts the same options as the json data source in Spark DataFrame reader APIs. The following code ...
Spark SQL - Convert String to Timestamp
Similar as Convert String to Date using Spark SQL , you can convert string of timestamp to Spark SQL timestamp data type. Function to_timestamp(timestamp_str[, fmt]) p arses the `timestamp_str` expression with the `fmt` expression to a timestamp data type in Spark. Example ...
Spark SQL - UNIX timestamp functions
Function unix_timestamp() returns the UNIX timestamp of current time. You can also specify a input timestamp value. Example: spark-sql> select unix_timestamp(); unix_timestamp(current_timestamp(), yyyy-MM-dd HH:mm:ss) 1610174099 spark-sql> select unix_timestamp(current_timestamp ...
Spark SQL - Date and Timestamp Function
Function current_date() or current_date can be used to return the current date at the start of query evaluation. Example: spark-sql> select current_date(); current_date() 2021-01-09 spark-sql> select current_date; current_date() 2021-01-09 *Brackets are optional for this ...
Spark SQL - LEAD Window Function
Spark LEAD function provides access to a row at a given offset that follows the current row in a window. This analytic function can be used in a SELECT statement to compare values in the current row with values in a following row. This function is like Spark SQL - LAG Window Function .
Spark SQL - LAG Window Function
Spark LAG function provides access to a row at a given offset that comes before the current row in the windows. This function can be used in a SELECT statement to compare values in the current row with values in a previous row. lag(input[, offset[, default]]) OVER ([PARYITION BY ..] ORDER BY ...) ...
Spark SQL - NTILE Window Function
Spark NTILE function divides the rows in each window to 'n' buckets ranging from 1 to at most 'n' (n is the specified parameter). The following sample SQL uses NTILE function to divide records in each window to two buckets. SELECT TXN.*, NTILE(2) OVER (PARTITION BY ...