Spark & PySpark
Articles
PySpark split and explode example
This code snippet shows you how to define a function to split a string column to an array of strings using Python built-in split function. It then explodes the array element from the split into using PySpark built-in explode function. Sample output `` +----------+-----------------+--------------------+-----+ | category| users| users_array| user| +----------+-----------------+--------------------+-----+ |Category A|user1,user2,user3|[user1, user2, us...|user1| |Category A|user1,user2,user3|[user1, user2, us...|user2| |Category A|user1,user2,user3|[user1, user2, us...|user3| |Category B| user3,user4| [user3, user4]|user3| |Category B| user3,user4| [user3, user4]|user4| +----------+-----------------+--------------------+-----+ ``
SCD Type 2 - Implement FULL Merge with Delta Lake Table via PySpark
java.lang.NoSuchMethodError: PoolConfig.setMinEvictableIdleTime
Streaming from Kafka to Delta Lake Table via PySpark
Delta Lake with PySpark Walkthrough
PySpark partitionBy with Examples
Spark Bucketing and Bucket Pruning Explained
Spark Basics - Application, Driver, Executor, Job, Stage and Task Walkthrough
Spark cache() and persist() Differences
Use Spark SQL Partitioning Hints
Spark Join Strategy Hints for SQL Queries
Spark spark.sql.files.maxPartitionBytes Explained in Detail
Differences between spark.sql.shuffle.partitions and spark.default.parallelism
Introduction to PySpark ArrayType and MapType
Introduction to PySpark StructType and StructField
Spark Insert Data into Hive Tables
Extract Value from XML Column in PySpark DataFrame
PySpark - Flatten (Explode) Nested StructType Column
PySpark - Read and Parse Apache Access Log Text Files
PySpark - Read from Hive Tables
PySpark - Read and Write JSON
Spark DEBUG: It is possible the underlying files have been updated.
Spark Dynamic and Static Partition Overwrite
Fix - TypeError: an integer is required (got type bytes)
Spark 2.x to 3.x - Date, Timestamp and Int96 Rebase Modes
PySpark - Read Data from MariaDB Database
PySpark - Read Data from Oracle Database
Spark Schema Merge (Evolution) for Orc Files
PySpark - Read and Write Orc Files
PySpark - Read and Write Avro Files
Spark Hash Functions Introduction - MD5 and SHA
Install Spark 3.2.1 on Linux or WSL
Spark SQL - Literals (Constants)
Spark SQL Joins with Examples
Spark submit --num-executors --executor-cores --executor-memory
Spark repartition Function Internals
Create Spark Indexes via Hyperspace
Read Parquet Files from Nested Directories
Spark Read JSON Lines (.jsonl) File
Spark SQL - PERCENT_RANK Window Function
Spark - Read and Write Data with MongoDB
Spark Dataset and DataFrame
Spark SQL - Date Difference in Seconds, Minutes, Hours
"Delete" Rows (Data) from PySpark DataFrame
Set Spark Python Versions via PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON
Resolve: Python in worker has different version 2.7 than that in driver 3.8...
PySpark: Read File in Google Cloud Storage
When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set
Killing Running Applications of Spark
Spark repartition vs. coalesce
Add JARs to a Spark Job
Connect to PostgreSQL in Spark (PySpark)
Spark 3.0.1: Connect to HBase 2.4.1
Spark Scala: Load Data from MySQL
Connect to MySQL in Spark (PySpark)
Apache Spark 3.0.1 Installation on macOS
Spark SQL - PIVOT Clause
Spark SQL - Array Functions
Spark SQL - Map Functions
Spark SQL - Convert JSON String to Map
Spark SQL - Convert String to Timestamp
Spark SQL - UNIX timestamp functions
Spark SQL - Date and Timestamp Function
Spark SQL - LEAD Window Function
Spark SQL - LAG Window Function
Spark SQL - NTILE Window Function
Spark SQL - DENSE_RANK Window Function
Spark SQL - RANK Window Function
Spark SQL - ROW_NUMBER Window Functions
Show Headings (Column Names) in spark-sql CLI Result
Apache Spark 3.0.1 Installation on Linux or WSL Guide
Error: Failed to load class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
Spark Scala: Load Data from Teradata
Spark Scala: Load Data from SQL Server
Spark Scala: Read XML File as DataFrame
Scala: Read CSV File as Spark DataFrame
Scala: Parse JSON String as Spark DataFrame
Scala: Change Column Type in Spark Data Frame
Scala: Filter Spark DataFrame Columns with None or Null Values
This article shows you how to filter NULL/None values from a Spark data frame using Scala. Function DataFrame.filter or DataFrame.where can be used to filter out null values.
Scala - Add Constant Column to Spark Data Frame
Scala: Remove Columns from Spark Data Frame
Scala: Change Data Frame Column Names in Spark
Scala: Convert List to Spark Data Frame
Fix - ERROR SparkUI: Failed to bind SparkUI
Spark SQL - Convert String to Date
About Configuration spark.sql.optimizer.metadataOnly
Filter Spark DataFrame Columns with None or Null Values
This article shows you how to filter NULL/None values from a Spark data frame using Python. Function DataFrame.filter or DataFrame.where can be used to filter out null values.