🚀 News: We are launching the Kontext Labs AI-Native Data Intelligence Platform Pilot! Click here to join our pilot program.

pyspark

112 items tagged with "pyspark"

110 Articles

2 Diagrams

Articles

PySpark split and explode example

This code snippet shows you how to define a function to split a string column to an array of strings using Python built-in split function. It then explodes the array element from the split into using PySpark built-in explode function. Sample output `` +----------+-----------------+--------------------+-----+ | category| users| users_array| user| +----------+-----------------+--------------------+-----+ |Category A|user1,user2,user3|[user1, user2, us...|user1| |Category A|user1,user2,user3|[user1, user2, us...|user2| |Category A|user1,user2,user3|[user1, user2, us...|user3| |Category B| user3,user4| [user3, user4]|user3| |Category B| user3,user4| [user3, user4]|user4| +----------+-----------------+--------------------+-----+ ``

Spark & PySpark

Use sort() and orderBy() with PySpark DataFrame

In Spark DataFrame, two APIs are provided to sort the rows in a DataFrame based on the provided column or columns: sort and orderBy. orderBy is just the alias for sort API. Syntax `` DataFrame.sort(cols, *kwargs) ` For *cols, we can used it to specify a column name, a Column object (pyspark.sql.Column), or a list of column names or Column objects. For kwargs, we can use it to specify additional arguments. For PySpark, we can specify a parameter named ascending. By default the value is True. It can be a list of boolean values for each columns that are used to sort the records. The code snippet provides the examples of sorting a DataFrame. Sample outputs ` +---+----+ | id|col1| +---+----+ | 2| E| | 4| E| | 6| E| | 8| E| | 1| O| | 3| O| | 5| O| | 7| O| | 9| O| +---+----+ +---+----+ | id|col1| +---+----+ | 2| E| | 4| E| | 6| E| | 8| E| | 1| O| | 3| O| | 5| O| | 7| O| | 9| O| +---+----+ +---+----+ | id|col1| +---+----+ | 8| E| | 6| E| | 4| E| | 2| E| | 9| O| | 7| O| | 5| O| | 3| O| | 1| O| +---+----+ ``

Code Snippets & Tips

Time Travel with Delta Table in PySpark

Delta Lake provides time travel functionalities to retrieve data at certain point of time or at certain version. This can be done easily using the following two options when reading from delta table as DataFrame: versionAsOf - an integer value to specify a version. timestampAsOf - A timestamp or date string. This code snippet shows you how to use them in Spark DataFrameReader APIs. It includes three examples: Query data as version 1 Query data as 30 days ago (using computed value via Spark SQL) Query data as certain timestamp ('2022-09-01 12:00:00.999999UTC 10:00') You may encounter issues if the timestamp is earlier than the earlier commit: pyspark.sql.utils.AnalysisException: The provided timestamp (2022-08-03 00:00:00.0) is before the earliest version available to this table (2022-08-27 10:53:18.213). Please use a timestamp after 2022-08-27 10:53:18. Similarly, if the provided timestamp is later than the last commit, you may encounter another issue like the following: pyspark.sql.utils.AnalysisException: The provided timestamp: 2022-09-07 00:00:00.0 is after the latest commit timestamp of 2022-08-27 11:30:47.185. If you wish to query this version of the table, please either provide the version with "VERSION AS OF 1" or use the exact timestamp of the last commit: "TIMESTAMP AS OF '2022-08-27 11:30:47'". References Delta Lake with PySpark Walkthrough

Code Snippets & Tips

Use expr() Function in PySpark DataFrame

Spark SQL function expr() can be used to evaluate a SQL expression and returns as a column (pyspark.sql.column.Column). Any operators or functions that can be used in Spark SQL can also be used with DataFrame operations. This code snippet provides an example of using expr() function directly with DataFrame. It also includes the snippet to derive a column without using this function. \* The code snippet assumes a SparkSession object already exists as 'spark'. Output: `` +---+-----+-----+ | id|idv1|idv2| +---+-----+-----+ | 1| 11| 11| | 2| 12| 12| | 3| 13| 13| | 4| 14| 14| | 5| 15| 15| | 6| 16| 16| | 7| 17| 17| | 8| 18| 18| | 9| 19| 19| +---+-----+-----+ ``

Code Snippets & Tips

SCD Type 2 - Implement FULL Merge with Delta Lake Table via PySpark

Spark & PySpark

PySpark DataFrame - Add or Subtract Milliseconds from Timestamp Column

This code snippets shows you how to add or subtract milliseconds (or microseconds) and seconds from a timestamp column in Spark DataFrame. It first creates a DataFrame in memory and then add and subtract milliseconds/seconds from the timestamp column ts using Spark SQL internals. Output: `` +---+--------------------------+--------------------------+--------------------------+--------------------------+ |id |ts |ts1 |ts2 |ts3 | +---+--------------------------+--------------------------+--------------------------+--------------------------+ |1 |2022-09-01 12:05:37.227916|2022-09-01 12:05:37.226916|2022-09-01 12:05:37.228916|2022-09-01 12:05:38.227916| |2 |2022-09-01 12:05:37.227916|2022-09-01 12:05:37.226916|2022-09-01 12:05:37.228916|2022-09-01 12:05:38.227916| |3 |2022-09-01 12:05:37.227916|2022-09-01 12:05:37.226916|2022-09-01 12:05:37.228916|2022-09-01 12:05:38.227916| |4 |2022-09-01 12:05:37.227916|2022-09-01 12:05:37.226916|2022-09-01 12:05:37.228916|2022-09-01 12:05:38.227916| +---+--------------------------+--------------------------+--------------------------+--------------------------+ ` \*Note - the code assuming SparkSession object already exists via variable name spark`.

Code Snippets & Tips

Streaming from Kafka to Delta Lake Table via PySpark

Spark & PySpark

Delta Lake with PySpark Walkthrough

Spark & PySpark

Use when() and otherwise() with PySpark DataFrame

In Spark SQL, CASE WHEN clause can be used to evaluate a list of conditions and to return one of the multiple results for each column. The same can be implemented directly using pyspark.sql.functions.when and pyspark.sql.Column.otherwise functions. If otherwise is not used together with when, None will be returned for unmatched conditions. Output: `` +---+------+ | id|id_new| +---+------+ | 1| 1| | 2| 200| | 3| 3000| | 4| 400| | 5| 5| | 6| 600| | 7| 7| | 8| 800| | 9| 9000| +---+------+ ``

Code Snippets & Tips

PySpark partitionBy with Examples

Spark & PySpark

Spark Bucketing and Bucket Pruning Explained

Spark & PySpark

PySpark - Save DataFrame into Hive Table using insertInto

This code snippets provides one example of inserting data into Hive table using PySpark DataFrameWriter.insertInto API. `` DataFrameWriter.insertInto(tableName: str, overwrite: Optional[bool] = None) ` It takes two parameters: tableName - the table to insert data into; overwrite` - whether to overwrite existing data. By default, it won't overwrite existing data. This function uses position-based resolution for columns instead of column names.

Code Snippets & Tips

Introduction to Hive Bucketed Table

Hadoop, Hive & HBase

Spark Basics - Application, Driver, Executor, Job, Stage and Task Walkthrough

Spark & PySpark

PySpark DataFrame - inner, left and right Joins

This code snippet shows you how to perform inner, left and right joins with DataFrame.join API. `` def join(self, other, on=None, how=None) ` Supported join types The default join type is inner. The supported values for parameter how are: inner, cross, outer, full, fullouter, full\outer, left, leftouter, left\outer, right, rightouter, right\outer, semi, leftsemi, left\semi, anti, leftanti and left\_anti. To learn about the these different join types, refer to article Spark SQL Joins with Examples. Join via multiple columns If there are more than one column to join, we can specify on parameter as a list of column name: ` df1.join(df2, on=['id','other_column'], how='left') ` Output from the code snippet: ` +---+----+ | id|attr| +---+----+ | 1| A| | 2| B| +---+----+ +---+--------+ | id|attr_int| +---+--------+ | 1| 100| | 2| 200| | 3| 300| +---+--------+ +---+----+--------+ | id|attr|attr_int| +---+----+--------+ | 1| A| 100| | 2| B| 200| +---+----+--------+ +---+----+--------+ | id|attr|attr_int| +---+----+--------+ | 1| A| 100| | 2| B| 200| +---+----+--------+ +---+----+--------+ | id|attr|attr_int| +---+----+--------+ | 1| A| 100| | 2| B| 200| | 3|null| 300| +---+----+--------+ ``

Code Snippets & Tips

Spark cache() and persist() Differences

Spark & PySpark

Spark Join Strategy Hints for SQL Queries

Spark & PySpark

PySpark - Read Parquet Files in S3

This code snippet provides an example of reading parquet files located in S3 buckets on AWS (Amazon Web Services). The bucket used is from New York City taxi trip record data. S3 bucket location is: s3a://ursa-labs-taxi-data/2009/01/data.parquet. To run the script, we need to setup the package dependency on Hadoop AWS package, for example, org.apache.hadoop:hadoop-aws:3.3.0. This can be easily done by passing configuration argument using spark-submit: `` spark-submit --conf spark.jars.packages=org.apache.hadoop:hadoop-aws:3.3.0 ` This can also be done via SparkConf: ` conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.0') ` Use temporary AWS credentials In this code snippet, AWS AnonymousAWSCredentialsProvider is used. If the bucket is not public, we can also use TemporaryAWSCredentialsProvider. ` conf.set('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider') conf.set('spark.hadoop.fs.s3a.access.key', ) conf.set('spark.hadoop.fs.s3a.secret.key', ) conf.set('spark.hadoop.fs.s3a.session.token', ) `` If you have used AWS CLI or SAML tools to cache local credentials ( ~/.aws/credentials), you then don't need to specify the access keys assuming the credential has access to the S3 bucket you are reading data from.

Code Snippets & Tips

Spark spark.sql.files.maxPartitionBytes Explained in Detail

Spark & PySpark

Differences between spark.sql.shuffle.partitions and spark.default.parallelism

Spark & PySpark

Iterate through PySpark DataFrame Rows via foreach

DataFrame.foreach can be used to iterate/loop through each row (pyspark.sql.types.Row) in a Spark DataFrame object and apply a function to all the rows. This method is a shorthand for DataFrame.rdd.foreach. Note: Please be cautious when using this method especially if your DataFrame is big. Output: `` +-----+--------+ | col1| col2| +-----+--------+ |Hello| Kontext| |Hello|Big Data| +-----+--------+ col1=Hello, col2=Kontext col1=Hello, col2=Big Data ``

Code Snippets & Tips

Remove Special Characters from Column in PySpark DataFrame

Spark SQL function regexreplace can be used to remove special characters from a string column in Spark DataFrame. Depends on the definition of special characters, the regular expressions can vary. For instance, [^0-9a-zA-Z\-]+ can be used to match characters that are not alphanumeric or are not hyphen(-) or underscore(\_); regular expression '[@\+\#\$\%\^\!]+' can match these defined special characters. This code snippet replace special characters with empty string. Output: `` +---+--------------------------+ |id |str | +---+--------------------------+ |1 |ABCDEDF!@#$%%^123456qwerty| |2 |ABCDE!!! | +---+--------------------------+ +---+-------------------+ | id| replaced_str| +---+-------------------+ | 1|ABCDEDF123456qwerty| | 2| ABCDE| +---+-------------------+ ``

Code Snippets & Tips

PySpark DataFrame - Calculate Distinct Count of Column(s)

This code snippet provides an example of calculating distinct count of values in PySpark DataFrame using countDistinct PySpark SQL function. Output: `` +---+-----+ | ID|Value| +---+-----+ |101| 56| |101| 67| |102| 70| |103| 93| |104| 70| +---+-----+ +-----------------+------------------+ |DistinctCountOfID|DistinctCountOfRow| +-----------------+------------------+ | 4| 5| +-----------------+------------------+ ``

Code Snippets & Tips

PySpark DataFrame - Calculate sum and avg with groupBy

This code snippet provides an example of calculating aggregated values after grouping data in PySpark DataFrame. To group data, DataFrame.groupby or DataFrame.groupBy can be used; then GroupedData.agg method can be used to aggregate data for each group. Built-in aggregation functions like sum, avg, max, min and others can be used. Customized aggregation functions can also be used. Output: `` +----------+--------+ |TotalScore|AvgScore| +----------+--------+ | 392| 78.4| +----------+--------+ ``

Code Snippets & Tips

PySpark DataFrame - percent_rank() Function

In Spark SQL, PERCENT\RANK(Spark SQL - PERCENT\RANK Window Function). This code snippet implements percentile ranking (relative ranking) directly using PySpark DataFrame percent_rank API instead of Spark SQL. Output: `` +-------+-----+------------------+ |Student|Score| percent_rank| +-------+-----+------------------+ | 101| 56| 0.0| | 109| 66|0.1111111111111111| | 103| 70|0.2222222222222222| | 110| 73|0.3333333333333333| | 107| 75|0.4444444444444444| | 102| 78|0.5555555555555556| | 108| 81|0.6666666666666666| | 104| 93|0.7777777777777778| | 105| 95|0.8888888888888888| | 106| 95|0.8888888888888888| +-------+-----+------------------+ ``

Code Snippets & Tips

PySpark DataFrame - rank() and dense_rank() Functions

In Spark SQL, rank and denserank functions can be used to rank the rows within a window partition. In Spark SQL, we can use RANK(Spark SQL - RANK Window Function) and DENSE\RANK(Spark SQL - DENSE\RANK Window Function). This code snippet implements ranking directly using PySpark DataFrame APIs instead of Spark SQL. It created a window that partitions the data by TXN_DT attribute and sorts the records in each partition via AMT column in descending order. The frame boundary of the window is defined as unbounded preceding and current row. Output: `` +----+------+-------------------+----+----------+ |ACCT| AMT| TXNDT|rank|denserank| +----+------+-------------------+----+----------+ | 101|102.01|2021-01-01 00:00:00| 1| 1| | 102| 93.0|2021-01-01 00:00:00| 2| 2| | 101| 10.01|2021-01-01 00:00:00| 3| 3| | 103| 913.1|2021-01-02 00:00:00| 1| 1| | 101|900.56|2021-01-03 00:00:00| 1| 1| | 102|900.56|2021-01-03 00:00:00| 1| 1| | 103| 80.0|2021-01-03 00:00:00| 3| 2| +----+------+-------------------+----+----------+ ` As printed out, the difference between dense_rank and rank `is that the former will not generate any gaps if the ranked values are the same for multiple rows.

Code Snippets & Tips

PySpark DataFrame - Add Row Number via row_number() Function

In Spark SQL, rownumber can be used to generate a series of sequential number starting from 1 for each record in the specified window. Examples can be found in this page: Spark SQL - ROW\NUMBER Window Functions. This code snippet provides the same approach to implement rownumber directly using PySpark DataFrame APIs instead of Spark SQL. It created a window that partitions the data by ACCT attribute and sorts the records in each partition via TXNDT column in descending order. The frame boundary of the window is defined as unbounded preceding and current row. Output: `` +----+------+-------------------+ |ACCT| AMT| TXN_DT| +----+------+-------------------+ | 101| 10.01|2021-01-01 00:00:00| | 101|102.01|2021-01-01 00:00:00| | 102| 93.0|2021-01-01 00:00:00| | 103| 913.1|2021-01-02 00:00:00| | 101|900.56|2021-01-03 00:00:00| +----+------+-------------------+ +----+------+-------------------+------+ |ACCT| AMT| TXN_DT|rownum| +----+------+-------------------+------+ | 101|900.56|2021-01-03 00:00:00| 1| | 101| 10.01|2021-01-01 00:00:00| 2| | 101|102.01|2021-01-01 00:00:00| 3| | 102| 93.0|2021-01-01 00:00:00| 1| | 103| 913.1|2021-01-02 00:00:00| 1| +----+------+-------------------+------+ ``

Code Snippets & Tips

Introduction to PySpark ArrayType and MapType

Spark & PySpark

PySpark DataFrame Fill Null Values with fillna or na.fill Functions

In PySpark, DataFrame.fillna, DataFrame.na.fill and DataFrameNaFunctions.fill are alias of each other. We can use them to fill null values with a constant value. For example, replace all null integer columns with value 0, etc. Output: `` +--------------+-------+--------+ | strcol|intcol|bool_col| +--------------+-------+--------+ |Hello Kontext!| 100| true| |Hello Context!| 0| null| +--------------+-------+--------+ +--------------+-------+--------+ | strcol|intcol|bool_col| +--------------+-------+--------+ |Hello Kontext!| 100| true| |Hello Context!| null| false| +--------------+-------+--------+ +--------------+-------+--------+ | strcol|intcol|bool_col| +--------------+-------+--------+ |Hello Kontext!| 100| true| |Hello Context!| 0| false| +--------------+-------+--------+ ``

Code Snippets & Tips

PySpark User Defined Functions (UDF)

User defined functions (UDF) in PySpark can be used to extend built-in function library to provide extra functionality, for example, creating a function to extract values from XML, etc. This code snippet shows you how to implement an UDF in PySpark. It shows two slightly different approaches - one use udf decorator and another without. Output: `` +--------------+-------+--------+--------+ | strcol|intcol|strlen1|strlen2| +--------------+-------+--------+--------+ |Hello Kontext!| 100| 14| 14| |Hello Context!| 100| 14| 14| +--------------+-------+--------+--------+ ``

Code Snippets & Tips

Introduction to PySpark StructType and StructField

Spark & PySpark

Spark Insert Data into Hive Tables

Spark & PySpark

PySpark DataFrame - Convert JSON Column to Row using json_tuple

PySpark SQL functions json_tuple can be used to convert DataFrame JSON string columns to tuples (new rows in the DataFrame). Syntax of this function looks like the following: `` pyspark.sql.functions.json_tuple(col, *fields) ` The first parameter is the JSON string column name in the DataFrame and the second is the filed name list to extract. If you need to extract complex JSON documents like JSON arrays, you can follow this article - PySpark: Convert JSON String Column to Array of Object (StructType) in DataFrame. Output ` StructType([StructField('id', LongType(), True), StructField('c0', StringType(), True), StructField('c1', StringType(), True), StructField('c2', StringType(), True)]) +---+---+------+----------+ | id| c0| c1| c2| +---+---+------+----------+ | 1| 1|10.201|2021-01-01| | 2| 2|20.201|2022-01-01| +---+---+------+----------+ ``

Code Snippets & Tips

PySpark DataFrame - Extract JSON Value using get_json_object Function

PySpark SQL functions getjsonobject can be used to extract JSON values from a JSON string column in Spark DataFrame. This is equivalent as using Spark SQL directly: Spark SQL - Extract Value from JSON String. Syntax of this function looks like the following: `` pyspark.sql.functions.getjsonobject(col, path) ` The first parameter is the JSON string column name in the DataFrame and the second is the JSON path. This code snippet shows you how to extract JSON values using JSON path. If you need to extract complex JSON documents like JSON arrays, you can follow this article - PySpark: Convert JSON String Column to Array of Object (StructType) in DataFrame. Output ` StructType([StructField('id', LongType(), True), StructField('jsoncol', StringType(), True), StructField('ATTRINT0', StringType(), True), StructField('ATTRDATE_1', StringType(), True)]) +---+--------------------+----------+-----------+ | id| jsoncol|ATTRINT0|ATTRDATE_1| +---+--------------------+----------+-----------+ | 1|[{"Attr_INT":1, "...| 1| 2022-01-01| +---+--------------------+----------+-----------+ ``

Code Snippets & Tips

Replace Values via regexp_replace Function in PySpark DataFrame

PySpark SQL APIs provides regexp_replace built-in function to replace string values that match with the specified regular expression. It takes three parameters: the input column of the DataFrame, regular expression and the replacement for matches. `` pyspark.sql.functions.regexp_replace(str, pattern, replacement) ` Output The following is the output from this code snippet: ` +--------------+-------+----------------+ | strcol|intcol|strcolreplaced| +--------------+-------+----------------+ |Hello Kontext!| 100| Hello kontext!| |Hello Context!| 100| Hello kontext!| +--------------+-------+----------------+ `` All uppercase 'K' or 'C' are replaced with lowercase 'k'.

Code Snippets & Tips

PySpark DataFrame - union, unionAll and unionByName

PySpark DataFrame provides three methods to union data together: union, unionAll and unionByName. The first two are like Spark SQL UNION ALL clause which doesn't remove duplicates. unionAll is the alias for union. We can use distinct method to deduplicate. The third function will use column names to resolve columns instead of positions. unionByName can also be used to merge two DataFrames with different schemas. Syntax of unionByName `` DataFrame.unionByName(other, allowMissingColumns=False) ` If allowMissingColumns is specified as True, the missing columns in both DataFrame will be added with default value null. This parameter is only available from Spark 3.1.0. For lower versions, the follow error may appear: df1.unionByName(df4, allowMissingColumns=True).show(truncate=False) TypeError: unionByName() got an unexpected keyword argument 'allowMissingColumns' Outputs The following are the outputs from the code snippet. ` +---+---+---+ |c1 |c2 |c3 | +---+---+---+ |1 |A |100| |2 |B |200| +---+---+---+ +---+---+---+ |c1 |c2 |c3 | +---+---+---+ |1 |A |100| |1 |A |100| |2 |B |200| +---+---+---+ +---+---+---+ |c1 |c2 |c3 | +---+---+---+ |1 |A |100| |2 |B |200| +---+---+---+ +---+---+----+----+ |c1 |c2 |c3 |c4 | +---+---+----+----+ |1 |A |100 |null| |3 |C |null|ABC | +---+---+----+----+ ``

Code Snippets & Tips

PySpark DataFrame - drop and dropDuplicates

PySpark DataFrame APIs provide two drop related methods: drop and dropDuplicates (or drop_duplicates). The former is used to drop specified column(s) from a DataFrame while the latter is used to drop duplicated rows. This code snippet utilizes these tow functions. Outputs: `` +----+------+ |ACCT|AMT | +----+------+ |101 |10.01 | |101 |10.01 | |101 |102.01| +----+------+ +----+----------+------+ |ACCT|TXN_DT |AMT | +----+----------+------+ |101 |2021-01-01|102.01| |101 |2021-01-01|10.01 | +----+----------+------+ +----+----------+------+ |ACCT|TXN_DT |AMT | +----+----------+------+ |101 |2021-01-01|102.01| |101 |2021-01-01|10.01 | +----+----------+------+ ``

Code Snippets & Tips

PySpark DataFrame - Select Columns using select Function

In PySpark, we can use select function to select a subset or all columns from a DataFrame. Syntax `` DataFrame.select(*cols) ` This function returns a new DataFrame object based on the projection expression list. This code snippet prints out the following output: ` +---+----------------+-------+---+ | id|customer_profile| name|age| +---+----------------+-------+---+ | 1| {Kontext, 3}|Kontext| 3| | 2| {Tech, 10}| Tech| 10| +---+----------------+-------+---+ ``

Code Snippets & Tips

PySpark DataFrame - Expand or Explode Nested StructType

Code Snippets & Tips

PySpark DataFrame - Filter Records using where and filter Function

Code Snippets & Tips

PySpark DataFrame - Add Column using withColumn

Code Snippets & Tips

PySpark DataFrame - explode Array and Map Columns

Code Snippets & Tips

Extract Value from XML Column in PySpark DataFrame

Spark & PySpark

PySpark - Flatten (Explode) Nested StructType Column

Spark & PySpark

PySpark - Read and Parse Apache Access Log Text Files

Spark & PySpark

PySpark - Read from Hive Tables

Spark & PySpark

PySpark - Read and Write JSON

Spark & PySpark

Fix - TypeError: an integer is required (got type bytes)

Spark & PySpark

Spark 2.x to 3.x - Date, Timestamp and Int96 Rebase Modes

Spark & PySpark

PySpark - Read Data from MariaDB Database

Spark & PySpark

PySpark - Read Data from Oracle Database

Spark & PySpark

Spark Schema Merge (Evolution) for Orc Files

Spark & PySpark

PySpark - Read and Write Orc Files

Spark & PySpark

PySpark - Read and Write Avro Files

Spark & PySpark

Spark Hash Functions Introduction - MD5 and SHA

Spark & PySpark

Spark submit --num-executors --executor-cores --executor-memory

Spark & PySpark

Spark repartition Function Internals

Spark & PySpark

Create Spark Indexes via Hyperspace

Spark & PySpark

Read Parquet Files from Nested Directories

Spark & PySpark

Spark Read JSON Lines (.jsonl) File

Spark & PySpark

Spark - Read and Write Data with MongoDB

Spark & PySpark

Spark - 保存DataFrame为Hive数据表

Spark (PySpark) - 从SQL Server数据库中读取数据

PySpark - 转换Python数组或串列为Spark DataFrame

Spark SQL - Date Difference in Seconds, Minutes, Hours

Spark & PySpark

"Delete" Rows (Data) from PySpark DataFrame

Spark & PySpark

Spark SQL - Average (AVG) Calculation

Code Snippets & Tips

Spark SQL - Group By

Code Snippets & Tips

Set Spark Python Versions via PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON

Spark & PySpark

Spark - Check if Array Column Contains Specific Value

Code Snippets & Tips

Spark "ROW_ID"

Code Snippets & Tips

PySpark: Read File in Google Cloud Storage

Spark & PySpark

Spark - Read from BigQuery Table

Google Cloud Platform

Deduplicate Spark DataFrame via dropDuplicates() and distinct()

Code Snippets & Tips

Save Spark DataFrame to Teradata and Resolve Common Errors

Connect to PostgreSQL in Spark (PySpark)

Spark & PySpark

Convert Pandas DataFrame to Spark DataFrame

Code Snippets & Tips

Connect to MySQL in Spark (PySpark)

Spark & PySpark

Spark Scala: Load Data from Teradata

Spark & PySpark

Filter Spark DataFrame Columns with None or Null Values

This article shows you how to filter NULL/None values from a Spark data frame using Python. Function DataFrame.filter or DataFrame.where can be used to filter out null values.

Spark & PySpark

Change Column Type in PySpark DataFrame

Spark & PySpark

Add Constant Column to PySpark DataFrame

Spark & PySpark

Delete or Remove Columns from PySpark DataFrame

Spark & PySpark

Rename DataFrame Column Names in PySpark

Spark & PySpark

Install Apache Spark 3.0.0 on Windows 10

Spark & PySpark

Load CSV File in PySpark

Spark & PySpark

PySpark Read Multiline (Multiple Lines) from CSV File

Spark & PySpark

Save DataFrame to SQL Databases via JDBC in PySpark

Spark & PySpark

Spark Read from SQL Server Source using Windows/Kerberos Authentication

Spark & PySpark

Schema Merging (Evolution) with Parquet in Spark and Hive

Spark & PySpark

PySpark: Convert Python Dictionary List to Spark DataFrame

Spark & PySpark

Improve PySpark Performance using Pandas UDF with Apache Arrow

Spark & PySpark

Read and Write XML files in PySpark

Code Snippets & Tips

Convert Python Dictionary List to PySpark DataFrame

Spark & PySpark

Pass Environment Variables to Executors in PySpark

Code Snippets & Tips

Save DataFrame as CSV File in Spark

Spark & PySpark

Run Multiple Python Scripts PySpark Application with yarn-cluster Mode

Spark & PySpark

Convert PySpark Row List to Pandas Data Frame

Spark & PySpark

Fix PySpark TypeError: field : Type can not accept object ** in type <class '*'>

Spark & PySpark

PySpark: Convert Python Array/List to Spark Data Frame

Spark & PySpark

Load Data from Teradata in Spark (PySpark)

Spark & PySpark

Read Hadoop Credential in PySpark

Spark & PySpark

Data Partitioning Functions in Spark (PySpark) Deep Dive

Spark & PySpark

Get the Current Spark Context Settings/Configurations

Spark & PySpark

Data Partition in Spark (PySpark) In-depth Walkthrough

Spark & PySpark

PySpark - Fix PermissionError: [WinError 5] Access is denied

Spark & PySpark

Spark - Save DataFrame to Hive Table

Spark & PySpark

Connect to SQL Server in Spark (PySpark)

Spark & PySpark

Debug PySpark Code in Visual Studio Code

Spark & PySpark

Implement SCD Type 2 Full Merge via Spark Data Frames

Spark & PySpark

Diagrams

PySpark Reading from S3

This diagram is used as article feature images, which depicts reading data from S3 bucket via PySpark.

Solution Diagrams

Spark Partitioning Physical Operators

This diagram shows how Spark decides which repartition physical operators will be used for each scenario. `` repartition(numPartitions, *cols) ``

Solution Diagrams