spark

PySpark DataFrame - Add or Subtract Milliseconds from Timestamp Column

This code snippets shows you how to add or subtract milliseconds (or microseconds) and seconds from a timestamp column in Spark DataFrame. It first creates a DataFrame in memory and then add and subtract milliseconds/seconds from the timestamp column ts using Spark SQL internals. Output: `` +---+--------------------------+--------------------------+--------------------------+--------------------------+ |id |ts |ts1 |ts2 |ts3 | +---+--------------------------+--------------------------+--------------------------+--------------------------+ |1 |2022-09-01 12:05:37.227916|2022-09-01 12:05:37.226916|2022-09-01 12:05:37.228916|2022-09-01 12:05:38.227916| |2 |2022-09-01 12:05:37.227916|2022-09-01 12:05:37.226916|2022-09-01 12:05:37.228916|2022-09-01 12:05:38.227916| |3 |2022-09-01 12:05:37.227916|2022-09-01 12:05:37.226916|2022-09-01 12:05:37.228916|2022-09-01 12:05:38.227916| |4 |2022-09-01 12:05:37.227916|2022-09-01 12:05:37.226916|2022-09-01 12:05:37.228916|2022-09-01 12:05:38.227916| +---+--------------------------+--------------------------+--------------------------+--------------------------+ ` \*Note - the code assuming SparkSession object already exists via variable name spark`.

2022-09-01

Start Spark History Server UI

This code snippet provides the simple CLI to start Spark history server service. About Spark History Server Spark History Server can be used to look for historical Spark jobs that completed successfully or failed. By default, Spark execution logs are saved into local temporary folders. You can add configuration items into spark-default.xml to save logs to HDFS. For example, the following configurations ensure the logs are stored into my local Hadoop environment. `` spark.eventLog.enabled true spark.eventLog.dir hdfs://localhost:9000/shared/spark-logs spark.history.fs.logDirectory hdfs://localhost:9000/shared/spark-logs ` !2022082171715-image.png In the code snippet, SPARK_HOME `is the environment variable name that points to the location where you Spark is installed. If this variable is not defined, you can directly specify the full path to the shell script (sbin/start-history-server.sh). History Server URL By default, the URL is http://localhost:18080/http://localhost:18080/ in local environment. You can replace localhost with your server address where the history server is started. Usually it locates in edge servers. The UI looks like the following screenshot: !2022082171913-image.png By clicking the link of each App, you will be able to find the job details for each Spark applications.

2022-08-21

Spark Join Strategy Hints for SQL Queries

2022-08-21

Spark spark.sql.files.maxPartitionBytes Explained in Detail

2022-08-21

Differences between spark.sql.shuffle.partitions and spark.default.parallelism

2022-08-20

Spark Insert Data into Hive Tables

2022-08-17

Spark DEBUG: It is possible the underlying files have been updated.

2022-06-22

Spark Dynamic and Static Partition Overwrite

2022-06-22

Install Spark 3.3.0 on Linux or WSL

2022-06-20

Tools & Systems

Spark 2.x to 3.x - Date, Timestamp and Int96 Rebase Modes

2022-06-19

PySpark - Read and Write Orc Files

2022-06-18

PySpark - Read and Write Avro Files

2022-06-18

Spark Hash Functions Introduction - MD5 and SHA

2022-06-16

Install Spark 3.2.1 on Linux or WSL

2022-06-14

Spark SQL - Literals (Constants)

Spark SQL Joins with Examples

AWS EMR Debug - Container release on a lost node

2022-04-29

Cloud Computing

Spark submit --num-executors --executor-cores --executor-memory

2022-03-29

Spark repartition Function Internals

2022-03-28

Create Spark Indexes via Hyperspace

2021-12-22

Read Parquet Files from Nested Directories

2021-12-22

Spark Read JSON Lines (.jsonl) File

2021-12-21

Spark - Read and Write Data with MongoDB

2021-10-15

Spark - 保存DataFrame为Hive数据表

Spark (PySpark) - 从SQL Server数据库中读取数据

PySpark: 将DataFrame中的JSON字符列转换为数组

PySpark - 转换Python数组或串列为Spark DataFrame

Spark Dataset and DataFrame

"Delete" Rows (Data) from PySpark DataFrame

2021-09-25

Spark - Check if Array Column Contains Specific Value

2021-05-22

PySpark: Read File in Google Cloud Storage

2021-03-21

Spark - Read from BigQuery Table

2021-03-21

Google Cloud Platform

When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set

Killing Running Applications of Spark

Spark DataFrame: Show Full Column Content without Truncation

Save Spark DataFrame to Teradata and Resolve Common Errors

Teradata

Spark repartition vs. coalesce

2021-03-07

Add JARs to a Spark Job

2021-02-20

Connect to PostgreSQL in Spark (PySpark)

2021-02-14

Spark 3.0.1: Connect to HBase 2.4.1

2021-02-05

Spark Scala: Load Data from MySQL

2021-01-24

Connect to MySQL in Spark (PySpark)

2021-01-23

Apache Spark 3.0.1 Installation on macOS

2021-01-17

Show Headings (Column Names) in spark-sql CLI Result

2020-12-28

Apache Spark 3.0.1 Installation on Linux or WSL Guide

2020-12-27

Error: Failed to load class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver

2020-12-27

Spark Scala: Load Data from Teradata

2020-12-19

Spark Scala: Load Data from SQL Server

2020-12-18

Spark Scala: Read XML File as DataFrame

2020-12-16

Scala: Read CSV File as Spark DataFrame

2020-12-16

Scala: Parse JSON String as Spark DataFrame

2020-12-16

Scala: Change Column Type in Spark Data Frame

2020-12-14

Scala: Filter Spark DataFrame Columns with None or Null Values

This article shows you how to filter NULL/None values from a Spark data frame using Scala. Function DataFrame.filter or DataFrame.where can be used to filter out null values.

2020-12-14

Scala - Add Constant Column to Spark Data Frame

2020-12-14

Scala: Remove Columns from Spark Data Frame

Scala: Change Data Frame Column Names in Spark

Scala: Convert List to Spark Data Frame

Fix - ERROR SparkUI: Failed to bind SparkUI

Spark SQL - Convert String to Date

2020-10-23

Streaming Analytics & Kafka

Spark Structured Streaming - Read from and Write into Kafka Topics

2020-09-06

Filter Spark DataFrame Columns with None or Null Values

This article shows you how to filter NULL/None values from a Spark data frame using Python. Function DataFrame.filter or DataFrame.where can be used to filter out null values.

2020-08-10

Turn off INFO logs in Spark

Change Column Type in PySpark DataFrame

Add Constant Column to PySpark DataFrame

Delete or Remove Columns from PySpark DataFrame

Rename DataFrame Column Names in PySpark

Apache Spark 3.0.0 Installation on Linux Guide

Install Apache Spark 3.0.0 on Windows 10

Load CSV File in PySpark

2020-08-04

Improve PySpark Performance using Pandas UDF with Apache Arrow

2019-12-29

Read and Write XML files in PySpark

2019-12-26

Convert Python Dictionary List to PySpark DataFrame

2019-12-25

Pass Environment Variables to Executors in PySpark

2019-12-03

Save DataFrame as CSV File in Spark

2019-12-03

Write and read parquet files in Scala / Spark

Parquet is columnar store format published by Apache. It's commonly used in Hadoop ecosystem. There are many programming language APIs that have been implemented to support writing and reading parquet files.

Write and read parquet files in Python / Spark

Convert string to date in Scala / Spark

This code snippet shows how to convert string to date.

Convert string to date in Python / Spark

This code snippet shows how to convert string to date.

Run Multiple Python Scripts PySpark Application with yarn-cluster Mode

2019-08-25

Diagnostics: Container is running beyond physical memory limits

2019-07-17

Fix PySpark TypeError: field : Type can not accept object ** in type <class '*'>

2019-07-10

PySpark: Convert Python Array/List to Spark Data Frame

2019-07-10

Load Data from Teradata in Spark (PySpark)

2019-07-06

Read Hadoop Credential in PySpark

2019-07-06

Big Data Tools on Windows via Windows Subsystem for Linux (WSL)

2019-05-19

Sqoop

Apache Spark 2.4.3 Installation on Windows 10 using Windows Subsystem for Linux

2019-05-19

Install Zeppelin 0.7.3 on Windows 10 using Windows Subsystem for Linux (WSL)

2019-05-18

Zeppelin

.NET for Apache Spark Preview with Examples

2019-04-26

Data Partitioning Functions in Spark (PySpark) Deep Dive

2019-04-06

Get the Current Spark Context Settings/Configurations

2019-04-05

Data Partition in Spark (PySpark) In-depth Walkthrough

2019-03-30

PySpark - Fix PermissionError: [WinError 5] Access is denied

2019-03-27

Spark - Save DataFrame to Hive Table

2019-03-27

Connect to SQL Server in Spark (PySpark)

2019-03-23

Debug PySpark Code in Visual Studio Code

2019-03-03

Implement SCD Type 2 Full Merge via Spark Data Frames

2019-02-03

PySpark: Convert JSON String Column to Array of Object (StructType) in Data Frame

2019-01-05

Load Data into HDFS from SQL Server via Sqoop

2018-04-23

Sqoop

Write and Read Parquet Files in HDFS through Spark/Scala

2018-03-17

Write and Read Parquet Files in Spark/Scala

2018-03-17

Read Text File from Hadoop in Zeppelin through Spark Context

2018-03-03

Diagrams

Spark Application Anatomy

This diagram depicts the relationships among Spark application, job, stage and task. One Spark application can contain multiple actions and each action will be related to one Spark job; to run the computation within a job, multiple stages might be involved as some actions cannot be done within just one stage; each stage will include many tasks and the task count is decided by the total partitions in the RDD/DataFrame. Task is a lowest parallelism unit in Spark.

2022-08-23

Spark SQL Joins - Cross Join (Cartesian Product)

This diagram shows Cross Join type in Spark SQL. It returns the Cartesian product of two tables (relations). References JOIN - Spark 3.2.1 Documentation (apache.org)

Spark SQL Joins - Left Anti Join

This diagram shows Left Anti Join type in Spark SQL. An anti join returns returns values from the left relation that has no match with the right. It is also called left anti join. References JOIN - Spark 3.2.1 Documentation (apache.org)

Spark SQL Joins - Left Semi Join

This diagram shows Left Semi Join type in Spark SQL. A semi join returns values from the left side of the relation that has a match with the right. It is also called left semi join. References JOIN - Spark 3.2.1 Documentation (apache.org)

Spark SQL Joins - Full Outer Join

This diagram shows Full Join type in Spark SQL. It returns all values from both relations, appending NULL values on the side that does not have a match. It is also called full outer join. References JOIN - Spark 3.2.1 Documentation (apache.org)

Spark SQL Joins - Right Outer Join

This diagram shows Right Join type in Spark SQL. It returns all values from the right relation and the matched values from the left relation, or appends NULL if there is no match. It is also called right outer join. References JOIN - Spark 3.2.1 Documentation (apache.org)

Spark SQL Joins - Left Outer Join

This diagram shows Left Join type in Spark SQL. It returns all values from the left relation and the matched values from the right relation, or appends NULL if there is no match. It is also called left outer join. References JOIN - Spark 3.2.1 Documentation (apache.org)

Spark SQL Joins - Inner Join

This diagram shows Inner Join type in Spark SQL. It returns rows that have matching values in both tables (relations). References JOIN - Spark 3.2.1 Documentation (apache.org)

Spark Partitioning Physical Operators

This diagram shows how Spark decides which repartition physical operators will be used for each scenario. `` repartition(numPartitions, *cols) ``

2022-03-29

Spark Memory Management Overview

This diagram shows an overview of Spark memory management when running in YARN. It helps you to understand how your Spark memory is allocated and how they are used. In Spark executor, there are two types of memory used: Execution memory - refers to that used for computation in shuffles, joins, sorts and aggregations; Storage memory - refers to that used for caching and propagating internal data across the cluster. When no storage memory is used, execution can use all the available memory and vice versa. These two types of memory usage are decided by two configuration items: spark.memory.fraction expresses the size of M as a fraction of the (JVM heap space - 300MiB) (default 0.6). The rest of the space (40%) is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors. spark.memory.storageFraction expresses the size of R as a fraction of M (default 0.5).

2022-03-27

Spark Partition Discovery

Spark supports partition discovery. All built in file sources (Text/CSV/JSON/ORC/Parquet) supports partition discovery and partition information inference. This data shows a example data set that is stored by two partition levels: month and country. The following code snippet will read all the underlying parquet files: `` df = spark.read.option("basePath","/data").parquet("/data") ``

2021-12-22