Set Spark Python Versions via PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON

Raymond Raymond event 2021-09-05 visibility 12,823
more_vert

PySpark utilizes Python worker processes to perform transformations. It's important to set the Python versions correctly. 

Spark configurations

There are two Spark configuration items to specify Python version since version 2.1.0.

  • spark.pyspark.driver.python: Python binary executable to use for PySpark in driver. The default is spark.pyspark.python.
  • spark.pyspark.pythonPython binary executable to use for PySpark in both driver and executors.

In most cases, your Spark cluster administrators should have setup these properties correctly and you don't need to worry. For example, the following is the configuration example (spark-defaults.conf) of my local Spark cluster on Windows 10 using Python 2.7 for both driver and executors

spark.pyspark.python "D:\\Python2.7\\python.exe"
spark.pyspark.driver.python "D:\\Python2.7\\python.exe"

Environment variables

Environment variables can also be used by users if the above properties are not specified in configuration files:

  • PYSPARK_PYTHON: Python binary executable to use for PySpark in both driver and workers. The default is python3 if available, otherwise python. Property spark.pyspark.python take precedence if it is set.
  • PYSPARK_DRIVER_PYTHON: Python binary executable to use for PySpark in driver only. The default is PYSPARK_PYTHON. Property spark.pyspark.driver.python take precedence if it is set.

In Windows standalone local cluster, you can use system environment variables to directly set these environment variables. For Linux machines, you can specify it through ~/.bashrc.

The following is one example:

export PYSPARK_PYTHON=/path/to/your/python/executable
warning lf PySpark Python driver and executor properties are already set, the environment variables won't take effect. 

Fix issue about inconsistent driver and executor Python versions

If the driver and executor have different Python versions, you may encounter errors like the following:

Exception: Python in worker has different version 2.7 than that in driver 3.8, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

Refer to page to find out more: Resolve: Python in worker has different version 2.7 than that in driver 3.8...

More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts