spark

127 items tagged with "spark"

116 Articles
11 Diagrams

Articles

Get Started with Apache Kylin - OLAP for Big Data

2023-09-14
The Data Engineering

SCD Type 2 - Implement FULL Merge with Delta Lake Table via PySpark

2022-09-01
Spark & PySpark

PySpark DataFrame - Add or Subtract Milliseconds from Timestamp Column

This code snippets shows you how to add or subtract milliseconds (or microseconds) and seconds from a timestamp column in Spark DataFrame. It first creates a DataFrame in memory and then add and subtract milliseconds/seconds from the timestamp column ts using Spark SQL internals. Output: `` +---+--------------------------+--------------------------+--------------------------+--------------------------+ |id |ts |ts1 |ts2 |ts3 | +---+--------------------------+--------------------------+--------------------------+--------------------------+ |1 |2022-09-01 12:05:37.227916|2022-09-01 12:05:37.226916|2022-09-01 12:05:37.228916|2022-09-01 12:05:38.227916| |2 |2022-09-01 12:05:37.227916|2022-09-01 12:05:37.226916|2022-09-01 12:05:37.228916|2022-09-01 12:05:38.227916| |3 |2022-09-01 12:05:37.227916|2022-09-01 12:05:37.226916|2022-09-01 12:05:37.228916|2022-09-01 12:05:38.227916| |4 |2022-09-01 12:05:37.227916|2022-09-01 12:05:37.226916|2022-09-01 12:05:37.228916|2022-09-01 12:05:38.227916| +---+--------------------------+--------------------------+--------------------------+--------------------------+ ` \*Note - the code assuming SparkSession object already exists via variable name spark`.

2022-09-01
Code Snippets & Tips

java.lang.NoSuchMethodError: PoolConfig.setMinEvictableIdleTime

2022-08-27
Spark & PySpark

Delta Lake with PySpark Walkthrough

2022-08-26
Spark & PySpark

PySpark partitionBy with Examples

2022-08-25
Spark & PySpark

Spark Bucketing and Bucket Pruning Explained

2022-08-24
Spark & PySpark

Spark Basics - Application, Driver, Executor, Job, Stage and Task Walkthrough

2022-08-23
Spark & PySpark

Spark cache() and persist() Differences

2022-08-21
Spark & PySpark

Use Spark SQL Partitioning Hints

2022-08-21
Spark & PySpark

Start Spark History Server UI

This code snippet provides the simple CLI to start Spark history server service. About Spark History Server Spark History Server can be used to look for historical Spark jobs that completed successfully or failed. By default, Spark execution logs are saved into local temporary folders. You can add configuration items into spark-default.xml to save logs to HDFS. For example, the following configurations ensure the logs are stored into my local Hadoop environment. `` spark.eventLog.enabled true spark.eventLog.dir hdfs://localhost:9000/shared/spark-logs spark.history.fs.logDirectory hdfs://localhost:9000/shared/spark-logs ` !2022082171715-image.png In the code snippet, SPARK_HOME `is the environment variable name that points to the location where you Spark is installed. If this variable is not defined, you can directly specify the full path to the shell script (sbin/start-history-server.sh). History Server URL By default, the URL is http://localhost:18080/http://localhost:18080/ in local environment. You can replace localhost with your server address where the history server is started. Usually it locates in edge servers. The UI looks like the following screenshot: !2022082171913-image.png By clicking the link of each App, you will be able to find the job details for each Spark applications.

2022-08-21
Code Snippets & Tips

Spark Join Strategy Hints for SQL Queries

2022-08-21
Spark & PySpark

Spark spark.sql.files.maxPartitionBytes Explained in Detail

2022-08-21
Spark & PySpark

Differences between spark.sql.shuffle.partitions and spark.default.parallelism

2022-08-20
Spark & PySpark

Spark Insert Data into Hive Tables

2022-08-17
Spark & PySpark

Spark DEBUG: It is possible the underlying files have been updated.

2022-06-22
Spark & PySpark

Spark Dynamic and Static Partition Overwrite

2022-06-22
Spark & PySpark

Install Spark 3.3.0 on Linux or WSL

2022-06-20
Tools & Systems

Spark 2.x to 3.x - Date, Timestamp and Int96 Rebase Modes

2022-06-19
Spark & PySpark

PySpark - Read and Write Orc Files

2022-06-18
Spark & PySpark

PySpark - Read and Write Avro Files

2022-06-18
Spark & PySpark

Spark Hash Functions Introduction - MD5 and SHA

2022-06-16
Spark & PySpark

Install Spark 3.2.1 on Linux or WSL

2022-06-14
Spark & PySpark

Spark SQL - Literals (Constants)

2022-05-31
Spark & PySpark

Spark SQL Joins with Examples

2022-05-31
Spark & PySpark

AWS EMR Debug - Container release on a *lost* node

2022-04-29
Cloud Computing

Spark submit --num-executors --executor-cores --executor-memory

2022-03-29
Spark & PySpark

Spark repartition Function Internals

2022-03-28
Spark & PySpark

Create Spark Indexes via Hyperspace

2021-12-22
Spark & PySpark

Read Parquet Files from Nested Directories

2021-12-22
Spark & PySpark

Spark Read JSON Lines (.jsonl) File

2021-12-21
Spark & PySpark

Spark - Read and Write Data with MongoDB

2021-10-15
Spark & PySpark

Spark - 保存DataFrame为Hive数据表

2021-10-13
Spark 中文

Spark (PySpark) - 从SQL Server数据库中读取数据

2021-10-13
Spark 中文

PySpark: 将DataFrame中的JSON字符列转换为数组

2021-10-13
Spark 中文

PySpark - 转换Python数组或串列为Spark DataFrame

2021-10-13
Spark 中文

Spark Dataset and DataFrame

2021-10-13
Spark & PySpark

"Delete" Rows (Data) from PySpark DataFrame

2021-09-25
Spark & PySpark

Spark - Check if Array Column Contains Specific Value

2021-05-22
Code Snippets & Tips

PySpark: Read File in Google Cloud Storage

2021-03-21
Spark & PySpark

Spark - Read from BigQuery Table

2021-03-21
Google Cloud Platform

When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set

2021-03-08
Spark & PySpark

Killing Running Applications of Spark

2021-03-08
Spark & PySpark

Spark DataFrame: Show Full Column Content without Truncation

2021-03-08
Code Snippets & Tips

Save Spark DataFrame to Teradata and Resolve Common Errors

2021-03-08
Teradata

Spark repartition vs. coalesce

2021-03-07
Spark & PySpark

Add JARs to a Spark Job

2021-02-20
Spark & PySpark

Connect to PostgreSQL in Spark (PySpark)

2021-02-14
Spark & PySpark

Spark 3.0.1: Connect to HBase 2.4.1

2021-02-05
Spark & PySpark

Spark Scala: Load Data from MySQL

2021-01-24
Spark & PySpark

Connect to MySQL in Spark (PySpark)

2021-01-23
Spark & PySpark

Apache Spark 3.0.1 Installation on macOS

2021-01-17
Spark & PySpark

Show Headings (Column Names) in spark-sql CLI Result

2020-12-28
Spark & PySpark

Apache Spark 3.0.1 Installation on Linux or WSL Guide

2020-12-27
Spark & PySpark

Error: Failed to load class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver

2020-12-27
Spark & PySpark

Get Started on .NET 5 with Apache Spark

.NET for Apache Spark 1.0 was officially released on 14th Oct 2020. This version was released together with .NET Core 3.0. Since .NET for Apache Spark is written with .NET Standards, it should work with .NET 5 too. This articles how to use .NET 5 with Apache Spark.

2020-12-23
.NET Programming

Spark Scala: Load Data from Teradata

2020-12-19
Spark & PySpark

Spark Scala: Load Data from SQL Server

2020-12-18
Spark & PySpark

Spark Scala: Read XML File as DataFrame

2020-12-16
Spark & PySpark

Scala: Read JSON file as Spark DataFrame

2020-12-16
Code Snippets & Tips

Scala: Read CSV File as Spark DataFrame

2020-12-16
Spark & PySpark

Scala: Parse JSON String as Spark DataFrame

2020-12-16
Spark & PySpark

Scala: Change Column Type in Spark Data Frame

2020-12-14
Spark & PySpark

Scala: Filter Spark DataFrame Columns with None or Null Values

This article shows you how to filter NULL/None values from a Spark data frame using Scala. Function DataFrame.filter or DataFrame.where can be used to filter out null values.

2020-12-14
Spark & PySpark

Scala - Add Constant Column to Spark Data Frame

2020-12-14
Spark & PySpark

Scala: Remove Columns from Spark Data Frame

2020-12-13
Spark & PySpark

Scala: Change Data Frame Column Names in Spark

2020-12-13
Spark & PySpark

Scala: Convert List to Spark Data Frame

2020-12-13
Spark & PySpark

Fix - ERROR SparkUI: Failed to bind SparkUI

2020-12-13
Spark & PySpark

Spark SQL - Convert String to Date

2020-10-23
Spark & PySpark

About Configuration spark.sql.optimizer.metadataOnly

2020-10-03
Spark & PySpark

Spark Structured Streaming - Read from and Write into Kafka Topics

2020-09-06
Streaming Analytics & Kafka

Filter Spark DataFrame Columns with None or Null Values

This article shows you how to filter NULL/None values from a Spark data frame using Python. Function DataFrame.filter or DataFrame.where can be used to filter out null values.

2020-08-10
Spark & PySpark

Turn off INFO logs in Spark

2020-08-09
Spark & PySpark

Change Column Type in PySpark DataFrame

2020-08-09
Spark & PySpark

Add Constant Column to PySpark DataFrame

2020-08-09
Spark & PySpark

Delete or Remove Columns from PySpark DataFrame

2020-08-09
Spark & PySpark

Rename DataFrame Column Names in PySpark

2020-08-09
Spark & PySpark

Apache Spark 3.0.0 Installation on Linux Guide

2020-08-09
Spark & PySpark

Install Apache Spark 3.0.0 on Windows 10

2020-08-09
Spark & PySpark

Load CSV File in PySpark

2020-08-04
Spark & PySpark

Improve PySpark Performance using Pandas UDF with Apache Arrow

2019-12-29
Spark & PySpark

Read and Write XML files in PySpark

2019-12-26
Code Snippets & Tips

Convert Python Dictionary List to PySpark DataFrame

2019-12-25
Spark & PySpark

Pass Environment Variables to Executors in PySpark

2019-12-03
Code Snippets & Tips

Save DataFrame as CSV File in Spark

2019-12-03
Spark & PySpark

Write and read parquet files in Scala / Spark

Parquet is columnar store format published by Apache. It's commonly used in Hadoop ecosystem. There are many programming language APIs that have been implemented to support writing and reading parquet files.

2019-11-18
Code Snippets & Tips

Write and read parquet files in Python / Spark

Parquet is columnar store format published by Apache. It's commonly used in Hadoop ecosystem. There are many programming language APIs that have been implemented to support writing and reading parquet files.

2019-11-18
Code Snippets & Tips

Convert string to date in Scala / Spark

This code snippet shows how to convert string to date.

2019-11-18
Code Snippets & Tips

Convert string to date in Python / Spark

This code snippet shows how to convert string to date.

2019-11-18
Code Snippets & Tips

Run Multiple Python Scripts PySpark Application with yarn-cluster Mode

2019-08-25
Spark & PySpark

Diagnostics: Container is running beyond physical memory limits

2019-07-17
Spark & PySpark

Fix PySpark TypeError: field **: **Type can not accept object ** in type <class '*'>

2019-07-10
Spark & PySpark

PySpark: Convert Python Array/List to Spark Data Frame

2019-07-10
Spark & PySpark

Load Data from Teradata in Spark (PySpark)

2019-07-06
Spark & PySpark

Read Hadoop Credential in PySpark

2019-07-06
Spark & PySpark

Big Data Tools on Windows via Windows Subsystem for Linux (WSL)

2019-05-19
Sqoop

Apache Spark 2.4.3 Installation on Windows 10 using Windows Subsystem for Linux

2019-05-19
Spark & PySpark

Install Zeppelin 0.7.3 on Windows 10 using Windows Subsystem for Linux (WSL)

2019-05-18
Zeppelin

.NET for Apache Spark Preview with Examples

2019-04-26
Spark & PySpark

Data Partitioning Functions in Spark (PySpark) Deep Dive

2019-04-06
Spark & PySpark

Get the Current Spark Context Settings/Configurations

2019-04-05
Spark & PySpark

Read Data from Hive in Spark 1.x and 2.x

2019-04-04
Spark & PySpark

Data Partition in Spark (PySpark) In-depth Walkthrough

2019-03-30
Spark & PySpark

PySpark - Fix PermissionError: [WinError 5] Access is denied

2019-03-27
Spark & PySpark

Spark - Save DataFrame to Hive Table

2019-03-27
Spark & PySpark

Connect to SQL Server in Spark (PySpark)

2019-03-23
Spark & PySpark

Debug PySpark Code in Visual Studio Code

2019-03-03
Spark & PySpark

Implement SCD Type 2 Full Merge via Spark Data Frames

2019-02-03
Spark & PySpark

PySpark: Convert JSON String Column to Array of Object (StructType) in Data Frame

2019-01-05
Spark & PySpark

Load Data into HDFS from SQL Server via Sqoop

2018-04-23
Sqoop

Write and Read Parquet Files in HDFS through Spark/Scala

2018-03-17
Spark & PySpark

Write and Read Parquet Files in Spark/Scala

2018-03-17
Spark & PySpark

Read Text File from Hadoop in Zeppelin through Spark Context

2018-03-03
Spark & PySpark

Install Spark 2.2.1 in Windows

2018-02-25
Spark & PySpark

Install Zeppelin 0.7.3 on Windows

2018-02-07
Zeppelin

Diagrams

Spark Application Anatomy

This diagram depicts the relationships among Spark application, job, stage and task. One Spark application can contain multiple actions and each action will be related to one Spark job; to run the computation within a job, multiple stages might be involved as some actions cannot be done within just one stage; each stage will include many tasks and the task count is decided by the total partitions in the RDD/DataFrame. Task is a lowest parallelism unit in Spark.

2022-08-23
Solution Diagrams

Spark SQL Joins - Cross Join (Cartesian Product)

This diagram shows Cross Join type in Spark SQL. It returns the Cartesian product of two tables (relations). References JOIN - Spark 3.2.1 Documentation (apache.org)

2022-05-31
Kontext's Project

Spark SQL Joins - Left Anti Join

This diagram shows Left Anti Join type in Spark SQL. An anti join returns returns values from the left relation that has no match with the right. It is also called left anti join. References JOIN - Spark 3.2.1 Documentation (apache.org)

2022-05-31
Kontext's Project

Spark SQL Joins - Left Semi Join

This diagram shows Left Semi Join type in Spark SQL. A semi join returns values from the left side of the relation that has a match with the right. It is also called left semi join. References JOIN - Spark 3.2.1 Documentation (apache.org)

2022-05-31
Kontext's Project

Spark SQL Joins - Full Outer Join

This diagram shows Full Join type in Spark SQL. It returns all values from both relations, appending NULL values on the side that does not have a match. It is also called full outer join. References JOIN - Spark 3.2.1 Documentation (apache.org)

2022-05-31
Kontext's Project

Spark SQL Joins - Right Outer Join

This diagram shows Right Join type in Spark SQL. It returns all values from the right relation and the matched values from the left relation, or appends NULL if there is no match. It is also called right outer join. References JOIN - Spark 3.2.1 Documentation (apache.org)

2022-05-31
Kontext's Project

Spark SQL Joins - Left Outer Join

This diagram shows Left Join type in Spark SQL. It returns all values from the left relation and the matched values from the right relation, or appends NULL if there is no match. It is also called left outer join. References JOIN - Spark 3.2.1 Documentation (apache.org)

2022-05-31
Kontext's Project

Spark SQL Joins - Inner Join

This diagram shows Inner Join type in Spark SQL. It returns rows that have matching values in both tables (relations). References JOIN - Spark 3.2.1 Documentation (apache.org)

2022-05-31
Kontext's Project

Spark Partitioning Physical Operators

This diagram shows how Spark decides which repartition physical operators will be used for each scenario. `` repartition(numPartitions, *cols) ``

2022-03-29
Solution Diagrams

Spark Memory Management Overview

This diagram shows an overview of Spark memory management when running in YARN. It helps you to understand how your Spark memory is allocated and how they are used. In Spark executor, there are two types of memory used: Execution memory - refers to that used for computation in shuffles, joins, sorts and aggregations; Storage memory - refers to that used for caching and propagating internal data across the cluster. When no storage memory is used, execution can use all the available memory and vice versa. These two types of memory usage are decided by two configuration items: spark.memory.fraction expresses the size of M as a fraction of the (JVM heap space - 300MiB) (default 0.6). The rest of the space (40%) is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors. spark.memory.storageFraction expresses the size of R as a fraction of M (default 0.5).

2022-03-27
Solution Diagrams

Spark Partition Discovery

Spark supports partition discovery. All built in file sources (Text/CSV/JSON/ORC/Parquet) supports partition discovery and partition information inference. This data shows a example data set that is stored by two partition levels: month and country. The following code snippet will read all the underlying parquet files: `` df = spark.read.option("basePath","/data").parquet("/data") ``

2021-12-22
Solution Diagrams