Set Spark Python Versions via PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON
insights Stats
Apache Spark installation guides, performance tuning tips, general tutorials, etc.
*Spark logo is a registered trademark of Apache Spark.
PySpark utilizes Python worker processes to perform transformations. It's important to set the Python versions correctly.
Spark configurations
There are two Spark configuration items to specify Python version since version 2.1.0.
- spark.pyspark.driver.python: Python binary executable to use for PySpark in driver. The default is spark.pyspark.python.
- spark.pyspark.python: Python binary executable to use for PySpark in both driver and executors.
In most cases, your Spark cluster administrators should have setup these properties correctly and you don't need to worry. For example, the following is the configuration example (spark-defaults.conf) of my local Spark cluster on Windows 10 using Python 2.7 for both driver and executors:
spark.pyspark.python "D:\\Python2.7\\python.exe" spark.pyspark.driver.python "D:\\Python2.7\\python.exe"
Environment variables
Environment variables can also be used by users if the above properties are not specified in configuration files:
- PYSPARK_PYTHON: Python binary executable to use for PySpark in both driver and workers. The default is python3 if available, otherwise python. Property spark.pyspark.python take precedence if it is set.
- PYSPARK_DRIVER_PYTHON: Python binary executable to use for PySpark in driver only. The default is PYSPARK_PYTHON. Property spark.pyspark.driver.python take precedence if it is set.
In Windows standalone local cluster, you can use system environment variables to directly set these environment variables. For Linux machines, you can specify it through ~/.bashrc.
The following is one example:
export PYSPARK_PYTHON=/path/to/your/python/executable
Fix issue about inconsistent driver and executor Python versions
If the driver and executor have different Python versions, you may encounter errors like the following:
Exception: Python in worker has different version 2.7 than that in driver 3.8, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
Refer to page to find out more: Resolve: Python in worker has different version 2.7 than that in driver 3.8...