spark
127 items tagged with "spark"
Articles
Get Started with Apache Kylin - OLAP for Big Data
SCD Type 2 - Implement FULL Merge with Delta Lake Table via PySpark
PySpark DataFrame - Add or Subtract Milliseconds from Timestamp Column
This code snippets shows you how to add or subtract milliseconds (or microseconds) and seconds from a timestamp column in Spark DataFrame. It first creates a DataFrame in memory and then add and subtract milliseconds/seconds from the timestamp column ts using Spark SQL internals. Output: `` +---+--------------------------+--------------------------+--------------------------+--------------------------+ |id |ts |ts1 |ts2 |ts3 | +---+--------------------------+--------------------------+--------------------------+--------------------------+ |1 |2022-09-01 12:05:37.227916|2022-09-01 12:05:37.226916|2022-09-01 12:05:37.228916|2022-09-01 12:05:38.227916| |2 |2022-09-01 12:05:37.227916|2022-09-01 12:05:37.226916|2022-09-01 12:05:37.228916|2022-09-01 12:05:38.227916| |3 |2022-09-01 12:05:37.227916|2022-09-01 12:05:37.226916|2022-09-01 12:05:37.228916|2022-09-01 12:05:38.227916| |4 |2022-09-01 12:05:37.227916|2022-09-01 12:05:37.226916|2022-09-01 12:05:37.228916|2022-09-01 12:05:38.227916| +---+--------------------------+--------------------------+--------------------------+--------------------------+ ` \*Note - the code assuming SparkSession object already exists via variable name spark`.
java.lang.NoSuchMethodError: PoolConfig.setMinEvictableIdleTime
Delta Lake with PySpark Walkthrough
PySpark partitionBy with Examples
Spark Bucketing and Bucket Pruning Explained
Spark Basics - Application, Driver, Executor, Job, Stage and Task Walkthrough
Spark cache() and persist() Differences
Use Spark SQL Partitioning Hints
Start Spark History Server UI
This code snippet provides the simple CLI to start Spark history server service. About Spark History Server Spark History Server can be used to look for historical Spark jobs that completed successfully or failed. By default, Spark execution logs are saved into local temporary folders. You can add configuration items into spark-default.xml to save logs to HDFS. For example, the following configurations ensure the logs are stored into my local Hadoop environment. `` spark.eventLog.enabled true spark.eventLog.dir hdfs://localhost:9000/shared/spark-logs spark.history.fs.logDirectory hdfs://localhost:9000/shared/spark-logs ` !2022082171715-image.png In the code snippet, SPARK_HOME `is the environment variable name that points to the location where you Spark is installed. If this variable is not defined, you can directly specify the full path to the shell script (sbin/start-history-server.sh). History Server URL By default, the URL is http://localhost:18080/http://localhost:18080/ in local environment. You can replace localhost with your server address where the history server is started. Usually it locates in edge servers. The UI looks like the following screenshot: !2022082171913-image.png By clicking the link of each App, you will be able to find the job details for each Spark applications.
Spark Join Strategy Hints for SQL Queries
Spark spark.sql.files.maxPartitionBytes Explained in Detail
Differences between spark.sql.shuffle.partitions and spark.default.parallelism
Spark Insert Data into Hive Tables
Spark DEBUG: It is possible the underlying files have been updated.
Spark Dynamic and Static Partition Overwrite
Install Spark 3.3.0 on Linux or WSL
Spark 2.x to 3.x - Date, Timestamp and Int96 Rebase Modes
PySpark - Read and Write Orc Files
PySpark - Read and Write Avro Files
Spark Hash Functions Introduction - MD5 and SHA
Install Spark 3.2.1 on Linux or WSL
Spark SQL - Literals (Constants)
Spark SQL Joins with Examples
AWS EMR Debug - Container release on a *lost* node
Spark submit --num-executors --executor-cores --executor-memory
Spark repartition Function Internals
Create Spark Indexes via Hyperspace
Read Parquet Files from Nested Directories
Spark Read JSON Lines (.jsonl) File
Spark - Read and Write Data with MongoDB
Spark - 保存DataFrame为Hive数据表
Spark (PySpark) - 从SQL Server数据库中读取数据
PySpark: 将DataFrame中的JSON字符列转换为数组
PySpark - 转换Python数组或串列为Spark DataFrame
Spark Dataset and DataFrame
"Delete" Rows (Data) from PySpark DataFrame
Spark - Check if Array Column Contains Specific Value
PySpark: Read File in Google Cloud Storage
Spark - Read from BigQuery Table
When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set
Killing Running Applications of Spark
Spark DataFrame: Show Full Column Content without Truncation
Save Spark DataFrame to Teradata and Resolve Common Errors
Spark repartition vs. coalesce
Add JARs to a Spark Job
Connect to PostgreSQL in Spark (PySpark)
Spark 3.0.1: Connect to HBase 2.4.1
Spark Scala: Load Data from MySQL
Connect to MySQL in Spark (PySpark)
Apache Spark 3.0.1 Installation on macOS
Show Headings (Column Names) in spark-sql CLI Result
Apache Spark 3.0.1 Installation on Linux or WSL Guide
Error: Failed to load class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
Get Started on .NET 5 with Apache Spark
.NET for Apache Spark 1.0 was officially released on 14th Oct 2020. This version was released together with .NET Core 3.0. Since .NET for Apache Spark is written with .NET Standards, it should work with .NET 5 too. This articles how to use .NET 5 with Apache Spark.
Spark Scala: Load Data from Teradata
Spark Scala: Load Data from SQL Server
Spark Scala: Read XML File as DataFrame
Scala: Read JSON file as Spark DataFrame
Scala: Read CSV File as Spark DataFrame
Scala: Parse JSON String as Spark DataFrame
Scala: Change Column Type in Spark Data Frame
Scala: Filter Spark DataFrame Columns with None or Null Values
This article shows you how to filter NULL/None values from a Spark data frame using Scala. Function DataFrame.filter or DataFrame.where can be used to filter out null values.
Scala - Add Constant Column to Spark Data Frame
Scala: Remove Columns from Spark Data Frame
Scala: Change Data Frame Column Names in Spark
Scala: Convert List to Spark Data Frame
Fix - ERROR SparkUI: Failed to bind SparkUI
Spark SQL - Convert String to Date
About Configuration spark.sql.optimizer.metadataOnly
Spark Structured Streaming - Read from and Write into Kafka Topics
Filter Spark DataFrame Columns with None or Null Values
This article shows you how to filter NULL/None values from a Spark data frame using Python. Function DataFrame.filter or DataFrame.where can be used to filter out null values.
Turn off INFO logs in Spark
Change Column Type in PySpark DataFrame
Add Constant Column to PySpark DataFrame
Delete or Remove Columns from PySpark DataFrame
Rename DataFrame Column Names in PySpark
Apache Spark 3.0.0 Installation on Linux Guide
Install Apache Spark 3.0.0 on Windows 10
Load CSV File in PySpark
Improve PySpark Performance using Pandas UDF with Apache Arrow
Read and Write XML files in PySpark
Convert Python Dictionary List to PySpark DataFrame
Pass Environment Variables to Executors in PySpark
Save DataFrame as CSV File in Spark
Write and read parquet files in Scala / Spark
Parquet is columnar store format published by Apache. It's commonly used in Hadoop ecosystem. There are many programming language APIs that have been implemented to support writing and reading parquet files.
Write and read parquet files in Python / Spark
Parquet is columnar store format published by Apache. It's commonly used in Hadoop ecosystem. There are many programming language APIs that have been implemented to support writing and reading parquet files.
Convert string to date in Scala / Spark
This code snippet shows how to convert string to date.
Convert string to date in Python / Spark
This code snippet shows how to convert string to date.
Run Multiple Python Scripts PySpark Application with yarn-cluster Mode
Diagnostics: Container is running beyond physical memory limits
Fix PySpark TypeError: field **: **Type can not accept object ** in type <class '*'>
PySpark: Convert Python Array/List to Spark Data Frame
Load Data from Teradata in Spark (PySpark)
Read Hadoop Credential in PySpark
Big Data Tools on Windows via Windows Subsystem for Linux (WSL)
Apache Spark 2.4.3 Installation on Windows 10 using Windows Subsystem for Linux
Install Zeppelin 0.7.3 on Windows 10 using Windows Subsystem for Linux (WSL)
.NET for Apache Spark Preview with Examples
Data Partitioning Functions in Spark (PySpark) Deep Dive
Get the Current Spark Context Settings/Configurations
Read Data from Hive in Spark 1.x and 2.x
Data Partition in Spark (PySpark) In-depth Walkthrough
PySpark - Fix PermissionError: [WinError 5] Access is denied
Spark - Save DataFrame to Hive Table
Connect to SQL Server in Spark (PySpark)
Debug PySpark Code in Visual Studio Code
Implement SCD Type 2 Full Merge via Spark Data Frames
PySpark: Convert JSON String Column to Array of Object (StructType) in Data Frame
Load Data into HDFS from SQL Server via Sqoop
Write and Read Parquet Files in HDFS through Spark/Scala
Write and Read Parquet Files in Spark/Scala
Read Text File from Hadoop in Zeppelin through Spark Context
Install Spark 2.2.1 in Windows
Install Zeppelin 0.7.3 on Windows
Diagrams
Spark Application Anatomy
This diagram depicts the relationships among Spark application, job, stage and task. One Spark application can contain multiple actions and each action will be related to one Spark job; to run the computation within a job, multiple stages might be involved as some actions cannot be done within just one stage; each stage will include many tasks and the task count is decided by the total partitions in the RDD/DataFrame. Task is a lowest parallelism unit in Spark.
Spark SQL Joins - Cross Join (Cartesian Product)
This diagram shows Cross Join type in Spark SQL. It returns the Cartesian product of two tables (relations). References JOIN - Spark 3.2.1 Documentation (apache.org)
Spark SQL Joins - Left Anti Join
This diagram shows Left Anti Join type in Spark SQL. An anti join returns returns values from the left relation that has no match with the right. It is also called left anti join. References JOIN - Spark 3.2.1 Documentation (apache.org)
Spark SQL Joins - Left Semi Join
This diagram shows Left Semi Join type in Spark SQL. A semi join returns values from the left side of the relation that has a match with the right. It is also called left semi join. References JOIN - Spark 3.2.1 Documentation (apache.org)
Spark SQL Joins - Full Outer Join
This diagram shows Full Join type in Spark SQL. It returns all values from both relations, appending NULL values on the side that does not have a match. It is also called full outer join. References JOIN - Spark 3.2.1 Documentation (apache.org)
Spark SQL Joins - Right Outer Join
This diagram shows Right Join type in Spark SQL. It returns all values from the right relation and the matched values from the left relation, or appends NULL if there is no match. It is also called right outer join. References JOIN - Spark 3.2.1 Documentation (apache.org)
Spark SQL Joins - Left Outer Join
This diagram shows Left Join type in Spark SQL. It returns all values from the left relation and the matched values from the right relation, or appends NULL if there is no match. It is also called left outer join. References JOIN - Spark 3.2.1 Documentation (apache.org)
Spark SQL Joins - Inner Join
This diagram shows Inner Join type in Spark SQL. It returns rows that have matching values in both tables (relations). References JOIN - Spark 3.2.1 Documentation (apache.org)
Spark Partitioning Physical Operators
This diagram shows how Spark decides which repartition physical operators will be used for each scenario. `` repartition(numPartitions, *cols) ``
Spark Memory Management Overview
This diagram shows an overview of Spark memory management when running in YARN. It helps you to understand how your Spark memory is allocated and how they are used. In Spark executor, there are two types of memory used: Execution memory - refers to that used for computation in shuffles, joins, sorts and aggregations; Storage memory - refers to that used for caching and propagating internal data across the cluster. When no storage memory is used, execution can use all the available memory and vice versa. These two types of memory usage are decided by two configuration items: spark.memory.fraction expresses the size of M as a fraction of the (JVM heap space - 300MiB) (default 0.6). The rest of the space (40%) is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors. spark.memory.storageFraction expresses the size of R as a fraction of M (default 0.5).
Spark Partition Discovery
Spark supports partition discovery. All built in file sources (Text/CSV/JSON/ORC/Parquet) supports partition discovery and partition information inference. This data shows a example data set that is stored by two partition levels: month and country. The following code snippet will read all the underlying parquet files: `` df = spark.read.option("basePath","/data").parquet("/data") ``