Spark & PySpark

Articles

PySpark split and explode example

This code snippet shows you how to define a function to split a string column to an array of strings using Python built-in split function. It then explodes the array element from the split into using PySpark built-in explode function. Sample output `` +----------+-----------------+--------------------+-----+ | category| users| users_array| user| +----------+-----------------+--------------------+-----+ |Category A|user1,user2,user3|[user1, user2, us...|user1| |Category A|user1,user2,user3|[user1, user2, us...|user2| |Category A|user1,user2,user3|[user1, user2, us...|user3| |Category B| user3,user4| [user3, user4]|user3| |Category B| user3,user4| [user3, user4]|user4| +----------+-----------------+--------------------+-----+ ``

2023-08-06

SCD Type 2 - Implement FULL Merge with Delta Lake Table via PySpark

2022-09-01

java.lang.NoSuchMethodError: PoolConfig.setMinEvictableIdleTime

2022-08-27

Streaming from Kafka to Delta Lake Table via PySpark

2022-08-26

Delta Lake with PySpark Walkthrough

2022-08-26

PySpark partitionBy with Examples

2022-08-25

Spark Bucketing and Bucket Pruning Explained

2022-08-24

Spark Basics - Application, Driver, Executor, Job, Stage and Task Walkthrough

2022-08-23

Spark cache() and persist() Differences

2022-08-21

Use Spark SQL Partitioning Hints

2022-08-21

Spark Join Strategy Hints for SQL Queries

2022-08-21

Spark spark.sql.files.maxPartitionBytes Explained in Detail

2022-08-21

Differences between spark.sql.shuffle.partitions and spark.default.parallelism

2022-08-20

Introduction to PySpark ArrayType and MapType

2022-08-18

Introduction to PySpark StructType and StructField

2022-08-17

Spark Insert Data into Hive Tables

2022-08-17

Extract Value from XML Column in PySpark DataFrame

2022-07-15

PySpark - Flatten (Explode) Nested StructType Column

2022-07-09

PySpark - Read and Parse Apache Access Log Text Files

2022-07-09

PySpark - Read from Hive Tables

2022-07-08

PySpark - Read and Write JSON

2022-07-04

Spark DEBUG: It is possible the underlying files have been updated.

2022-06-22

Spark Dynamic and Static Partition Overwrite

2022-06-22

Fix - TypeError: an integer is required (got type bytes)

2022-06-19

Spark 2.x to 3.x - Date, Timestamp and Int96 Rebase Modes

2022-06-19

PySpark - Read Data from MariaDB Database

2022-06-18

PySpark - Read Data from Oracle Database

2022-06-18

Spark Schema Merge (Evolution) for Orc Files

2022-06-18

PySpark - Read and Write Orc Files

2022-06-18

PySpark - Read and Write Avro Files

2022-06-18

Spark Hash Functions Introduction - MD5 and SHA

2022-06-16

Install Spark 3.2.1 on Linux or WSL

2022-06-14

Spark SQL - Literals (Constants)

2022-05-31

Spark SQL Joins with Examples

2022-05-31

Spark submit --num-executors --executor-cores --executor-memory

2022-03-29

Spark repartition Function Internals

2022-03-28

Create Spark Indexes via Hyperspace

2021-12-22

Read Parquet Files from Nested Directories

2021-12-22

Spark Read JSON Lines (.jsonl) File

2021-12-21

Spark SQL - PERCENT_RANK Window Function

2021-10-18

Spark - Read and Write Data with MongoDB

2021-10-15

Spark Dataset and DataFrame

2021-10-13

Spark SQL - Date Difference in Seconds, Minutes, Hours

2021-10-12

"Delete" Rows (Data) from PySpark DataFrame

2021-09-25

Set Spark Python Versions via PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON

2021-09-05

Resolve: Python in worker has different version 2.7 than that in driver 3.8...

2021-05-17

PySpark: Read File in Google Cloud Storage

2021-03-21

When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set

2021-03-08

Killing Running Applications of Spark

2021-03-08

Spark repartition vs. coalesce

2021-03-07

Add JARs to a Spark Job

2021-02-20

Connect to PostgreSQL in Spark (PySpark)

2021-02-14

Spark 3.0.1: Connect to HBase 2.4.1

2021-02-05

Spark Scala: Load Data from MySQL

2021-01-24

Connect to MySQL in Spark (PySpark)

2021-01-23

Apache Spark 3.0.1 Installation on macOS

2021-01-17

Spark SQL - PIVOT Clause

2021-01-10

Spark SQL - Array Functions

2021-01-10

Spark SQL - Map Functions

2021-01-09

Spark SQL - Convert JSON String to Map

2021-01-09

Spark SQL - Convert String to Timestamp

2021-01-09

Spark SQL - UNIX timestamp functions

2021-01-09

Spark SQL - Date and Timestamp Function

2021-01-09

Spark SQL - LEAD Window Function

2021-01-06

Spark SQL - LAG Window Function

2021-01-06

Spark SQL - NTILE Window Function

2021-01-06

Spark SQL - DENSE_RANK Window Function

2021-01-06

Spark SQL - RANK Window Function

2021-01-03

Spark SQL - ROW_NUMBER Window Functions

2020-12-31

Show Headings (Column Names) in spark-sql CLI Result

2020-12-28

Apache Spark 3.0.1 Installation on Linux or WSL Guide

2020-12-27

Error: Failed to load class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver

2020-12-27

Spark Scala: Load Data from Teradata

2020-12-19

Spark Scala: Load Data from SQL Server

2020-12-18

Spark Scala: Read XML File as DataFrame

2020-12-16

Scala: Read CSV File as Spark DataFrame

2020-12-16

Scala: Parse JSON String as Spark DataFrame

2020-12-16

Scala: Change Column Type in Spark Data Frame

2020-12-14

Scala: Filter Spark DataFrame Columns with None or Null Values

This article shows you how to filter NULL/None values from a Spark data frame using Scala. Function DataFrame.filter or DataFrame.where can be used to filter out null values.

2020-12-14

Scala - Add Constant Column to Spark Data Frame

2020-12-14

Scala: Remove Columns from Spark Data Frame

2020-12-13

Scala: Change Data Frame Column Names in Spark

2020-12-13

Scala: Convert List to Spark Data Frame

2020-12-13

Fix - ERROR SparkUI: Failed to bind SparkUI

2020-12-13

Spark SQL - Convert String to Date

2020-10-23

About Configuration spark.sql.optimizer.metadataOnly

2020-10-03

Filter Spark DataFrame Columns with None or Null Values

This article shows you how to filter NULL/None values from a Spark data frame using Python. Function DataFrame.filter or DataFrame.where can be used to filter out null values.

2020-08-10

Turn off INFO logs in Spark

2020-08-09

Change Column Type in PySpark DataFrame

2020-08-09

Add Constant Column to PySpark DataFrame

2020-08-09

Delete or Remove Columns from PySpark DataFrame

2020-08-09

Rename DataFrame Column Names in PySpark

2020-08-09

Apache Spark 3.0.0 Installation on Linux Guide

2020-08-09

Install Apache Spark 3.0.0 on Windows 10

2020-08-09

Load CSV File in PySpark

2020-08-04

Python: Save Pandas DataFrame to Teradata

2020-05-03

PySpark Read Multiline (Multiple Lines) from CSV File

2020-03-31

Save DataFrame to SQL Databases via JDBC in PySpark

2020-03-20

Spark Read from SQL Server Source using Windows/Kerberos Authentication

2020-02-03

Schema Merging (Evolution) with Parquet in Spark and Hive

2020-02-02

PySpark: Convert Python Dictionary List to Spark DataFrame

2019-12-31

Improve PySpark Performance using Pandas UDF with Apache Arrow

2019-12-29

Convert Python Dictionary List to PySpark DataFrame

2019-12-25

Save DataFrame as CSV File in Spark

2019-12-03

Run Multiple Python Scripts PySpark Application with yarn-cluster Mode

2019-08-25

Convert PySpark Row List to Pandas Data Frame

2019-08-22

Diagnostics: Container is running beyond physical memory limits

2019-07-17

Fix PySpark TypeError: field **: **Type can not accept object ** in type <class '*'>

2019-07-10

PySpark: Convert Python Array/List to Spark Data Frame

2019-07-10

Load Data from Teradata in Spark (PySpark)

2019-07-06

Read Hadoop Credential in PySpark

2019-07-06

Apache Spark 2.4.3 Installation on Windows 10 using Windows Subsystem for Linux

2019-05-19

.NET for Apache Spark Preview with Examples

2019-04-26

Data Partitioning Functions in Spark (PySpark) Deep Dive

2019-04-06

Get the Current Spark Context Settings/Configurations

2019-04-05

Read Data from Hive in Spark 1.x and 2.x

2019-04-04

Data Partition in Spark (PySpark) In-depth Walkthrough

2019-03-30

PySpark - Fix PermissionError: [WinError 5] Access is denied

2019-03-27

Spark - Save DataFrame to Hive Table

2019-03-27

Connect to SQL Server in Spark (PySpark)

2019-03-23

Debug PySpark Code in Visual Studio Code

2019-03-03

Implement SCD Type 2 Full Merge via Spark Data Frames

2019-02-03

PySpark: Convert JSON String Column to Array of Object (StructType) in Data Frame

2019-01-05

Write and Read Parquet Files in HDFS through Spark/Scala

2018-03-17

Write and Read Parquet Files in Spark/Scala

2018-03-17

Convert String to Date in Spark (Scala)

2018-03-04

Read Text File from Hadoop in Zeppelin through Spark Context

2018-03-03

Install Big Data Tools (Spark, Zeppelin, Hadoop) in Windows for Learning and Practice

2018-02-25

Install Spark 2.2.1 in Windows

2018-02-25