Run Multiple Python Scripts PySpark Application with yarn-cluster Mode

access_time 2 years ago visibility3376 comment 0

When submitting Spark applications to YARN cluster, two deploy modes can be used: client and cluster. For client mode (default), Spark driver runs on the machine that the Spark application was submitted while for cluster mode, the driver runs on a random node in a cluster. On this page, I am going to show you how to submit an PySpark application with multiple Python script files in both modes.

PySpark application

The application is very simple with two scripts file.

pyspark_example.py

from pyspark.sql import SparkSession
from pyspark_example_module import test_function

appName = "Python Example - PySpark Row List to Pandas Data Frame"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .getOrCreate()

# Call the function
test_function()

This script file references another script file named pyspark_example_module.py.  It creates a Spark session and then call the function from the other module.

pyspark_example_module.py

This script file is a simple Python script file with a simple function in it.

def test_function():
    """
    Test function
    """
    print("This is a test function")

Run the application with local master

To run the application with local master, we can simply call spark-submit CLI in the script folder.

spark-submit pyspark_example.py

Run the application in YARN with deployment mode as client

Deploy mode is specified through argument --deploy-mode. --py-files is used to specify other Python script files used in this application.

spark-submit --master yarn --deploy-mode client --py-files pyspark_example_module.py  pyspark_example.py

Run the application in YARN with deployment mode as cluster

To run the application in cluster mode, simply change the argument --deploy-mode to cluster.

spark-submit --master yarn --deploy-mode cluster --py-files pyspark_example_module.py  pyspark_example.py

The scripts will complete successfully like the following log shows:

2019-08-25 12:07:09,047 INFO yarn.Client:
          client token: N/A
          diagnostics: N/A
          ApplicationMaster host: ***
          ApplicationMaster RPC port: 3047
          queue: default
          start time: 1566698770726
          final status: SUCCEEDED
          tracking URL: http://localhost:8088/proxy/application_1566698727165_0001/
          user: tangr

image

In YARN, the output is shown too as the above screenshot shows.

Submit scripts to HDFS so that it can be accessed by all the workers

When submit the application through Hue Oozie workflow, you usually can use HDFS file locations.

Use the following command to upload the script files to HDFS:

hadoop fs -copyFromLocal *.py /scripts

Both scripts are uploaded to the /scripts folder in HDFS:

-rw-r--r--   1 tangr supergroup        288 2019-08-25 12:11 /scripts/pyspark_example.py
-rw-r--r--   1 tangr supergroup         91 2019-08-25 12:11 /scripts/pyspark_example_module.py

And then run the following command to use the HDFS scripts:

spark-submit --master yarn --deploy-mode cluster --py-files hdfs://localhost:19000/scripts/pyspark_example_module.py  hdfs://localhost:19000/scripts/pyspark_example.py

The application should be able to complete successfully without errors.

If you use Hue, follow this page to set up your Spark action: How to Submit Spark jobs with Spark on YARN and Oozie.

Replace the file names accordingly:

  • Jar/py names: pyspark_example.py
  • Files: /scripts/pyspark_example_module.py
  • Options list: --py-files pyspark_example_module.py. If you have multiple files, sperate them with comma.
  • In the settings of this action, change master and deploy mode accordingly.

*Image from gethue.com.

info Last modified by Raymond at 2 years ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Want to publish your article on Kontext?

Learn more

Kontext Column

Created for everyone to publish data, programming and cloud related articles.
Follow three steps to create your columns.


Learn more arrow_forward

More from Kontext

Apache Spark 3.0.0 Installation on Linux Guide

local_offer spark local_offer linux local_offer WSL local_offer big-data-on-linux

visibility 563
thumb_up 0
access_time 3 months ago

This article provides step by step guide to install the latest version of Apache Spark 3.0.0 on a UNIX alike system (Linux) or Windows Subsystem for Linux (WSL). These instructions can be applied to Ubuntu, Debian, Red Hat, OpenSUSE, MacOS, etc.  If you are planning to configure Spark 3.0 on ...

Spark Read from SQL Server Source using Windows/Kerberos Authentication

local_offer pyspark local_offer SQL Server local_offer spark-2-x local_offer spark-database-connect

visibility 1105
thumb_up 0
access_time 9 months ago

In this article, I am going to show you how to use JDBC Kerberos authentication to connect to SQL Server sources in Spark (PySpark). I will use  Kerberos connection with principal names and password directly that requires  Microsoft JDBC Driver 6.2  or above. The sample code can run ...

Improve PySpark Performance using Pandas UDF with Apache Arrow

local_offer pyspark local_offer spark local_offer spark-2-x local_offer pandas local_offer spark-advanced

visibility 4142
thumb_up 4
access_time 11 months ago

Apache Arrow is an in-memory columnar data format that can be used in Spark to efficiently transfer data between JVM and Python processes. This currently is most beneficial to Python users that work with Pandas/NumPy data. In this article, I'm going to show you how to utilise Pandas UDF in ...

About column

Apache Spark installation guides, performance tuning tips, general tutorials, etc.

*Spark logo is a registered trademark of Apache Spark.

rss_feed Subscribe RSS