Spark Application Anatomy

2022-08-23 spark

This diagram depicts the relationships among Spark application, job, stage and task.

One Spark application can contain multiple actions and each action will  be related to one Spark job; to run the computation within a job, multiple stages might be involved as some actions cannot be done within just one stage; each stage will include many tasks and the task count is decided by the total partitions in the RDD/DataFrame. Task is a lowest parallelism unit in Spark.

Spark Application (PySpark script, Spark Scala application, Spark Session initiate via spark-shell or spark-sql, etc. )
[Not supported by viewer]
Spark Job 
[Not supported by viewer]
Spark Job 
[Not supported by viewer]
Spark Job 
[Not supported by viewer]
...
[Not supported by viewer]
Stage 1
[Not supported by viewer]
Stage 2
[Not supported by viewer]
Stage 3
[Not supported by viewer]
....
[Not supported by viewer]
Task 1
[Not supported by viewer]
Task 2
[Not supported by viewer]
Task 3
[Not supported by viewer]
...
[Not supported by viewer]