By using this site, you acknowledge that you have read and understand our Cookie policy, Privacy policy and Terms .

Apache Spark installation guides, performance tuning tips, general tutorials, etc.

rss_feed Subscribe RSS

When submitting Spark applications to YARN cluster, two deploy modes can be used: client and cluster. For client mode (default), Spark driver runs on the machine that the Spark application was submitted while for cluster mode, the driver runs on a random node in a cluster. On this page, I am going to show you how to submit an PySpark application with multiple Python script files in both modes.

PySpark application

The application is very simple with two scripts file.

from pyspark.sql import SparkSession
from pyspark_example_module import test_function

appName = "Python Example - PySpark Row List to Pandas Data Frame"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \

# Call the function

This script file references another script file named  It creates a Spark session and then call the function from the other module.

This script file is a simple Python script file with a simple function in it.

def test_function():
    Test function
    print("This is a test function")

Run the application with local master

To run the application with local master, we can simply call spark-submit CLI in the script folder.


Run the application in YARN with deployment mode as client

Deploy mode is specified through argument --deploy-mode. --py-files is used to specify other Python script files used in this application.

spark-submit --master yarn --deploy-mode client --py-files

Run the application in YARN with deployment mode as cluster

To run the application in cluster mode, simply change the argument --deploy-mode to cluster.

spark-submit --master yarn --deploy-mode cluster --py-files

The scripts will complete successfully like the following log shows:

2019-08-25 12:07:09,047 INFO yarn.Client:
          client token: N/A
          diagnostics: N/A
          ApplicationMaster host: ***
          ApplicationMaster RPC port: 3047
          queue: default
          start time: 1566698770726
          final status: SUCCEEDED
          tracking URL: http://localhost:8088/proxy/application_1566698727165_0001/
          user: tangr


In YARN, the output is shown too as the above screenshot shows.

Submit scripts to HDFS so that it can be accessed by all the workers

When submit the application through Hue Oozie workflow, you usually can use HDFS file locations.

Use the following command to upload the script files to HDFS:

hadoop fs -copyFromLocal *.py /scripts

Both scripts are uploaded to the /scripts folder in HDFS:

-rw-r--r--   1 tangr supergroup        288 2019-08-25 12:11 /scripts/
-rw-r--r--   1 tangr supergroup         91 2019-08-25 12:11 /scripts/

And then run the following command to use the HDFS scripts:

spark-submit --master yarn --deploy-mode cluster --py-files hdfs://localhost:19000/scripts/  hdfs://localhost:19000/scripts/

The application should be able to complete successfully without errors.

If you use Hue, follow this page to set up your Spark action: How to Submit Spark jobs with Spark on YARN and Oozie.

Replace the file names accordingly:

  • Jar/py names:
  • Files: /scripts/
  • Options list: --py-files If you have multiple files, sperate them with comma.
  • In the settings of this action, change master and deploy mode accordingly.

*Image from

info Last modified by Raymond at 6 months ago
info About author

info License/Terms

More from Kontext

local_offer windows10 local_offer hadoop local_offer hdfs

visibility 8
comment 0
thumb_up 0
access_time 1 day ago

Issue When installing Hadoop 3.2.1 on Windows 10,  you may encounter the following error when trying to format HDFS  namnode: ERROR namenode.NameNode: Failed to start namenode. The error happens when running the following comm...

open_in_new View

Compile and Build Hadoop 3.2.1 on Windows 10 Guide

local_offer windows10 local_offer hadoop

visibility 71
comment 0
thumb_up 1
access_time 6 days ago

This article provides detailed steps about how to compile and build Hadoop (incl. native libs) on Windows 10. The following guide is based on Hadoop release 3.2.1. ...

open_in_new View

Latest Hadoop 3.2.1 Installation on Windows 10 Step by Step Guide

local_offer windows10 local_offer hadoop local_offer yarn

visibility 71
comment 0
thumb_up 1
access_time 8 days ago

This detailed step-by-step guide shows you how to install the latest Hadoop (v3.2.1) on Windows 10. It also provides a temporary fix for bug HDFS-14084 (java.lang.UnsupportedOperationException INFO).

open_in_new View

local_offer pyspark local_offer spark-2-x local_offer python

visibility 39
comment 0
thumb_up 0
access_time 26 days ago

This articles show you how to convert a Python dictionary list to a Spark DataFrame. The code snippets runs on Spark 2.x environments. Input The input data (dictionary list looks like the following): data = [{"Category": 'Category A', 'ItemID': 1, 'Amount': 12.40}, ...

open_in_new View