Resolve: Python in worker has different version 2.7 than that in driver 3.8...

This article provides resolutions about Spark exception caused by inconsistent Spark driver and executor python versions. This issue can happen when you run your Spark master in a local model with Python 3.8 while interacting with Hadoop cluster (incl. Hive) with Python 2.7.

Issue context

Spark application throws out the following error:

Exception: Python in worker has different version 2.7 than that in driver 3.8, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

To replicate the error, I can simply change the following configuration items in Spark default configuration file (spark-defaults.conf):

spark.pyspark.python "D:\\Python2.7\\python.exe"
#spark.pyspark.driver.python "D:\\Python2.7\\python.exe"
spark.pyspark.driver.python "D:\\Python\\python.exe"

The above configurations use Python 2 for executors and Python 3.8 (D:\Python\python.exe) for Spark driver.

Resolutions

The error message already provides hint to resolve this issue.

For my environment, I need to change Spark default configurations to be consistent. For example, the following configuration uses Python 2 for both driver and executor:

spark.pyspark.python "D:\\Python2.7\\python.exe"
spark.pyspark.driver.python "D:\\Python2.7\\python.exe"

Alternatively, you can also configure the two environment variables: PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON.

In some environments, you don't have permissions to change Spark configurations, for example, Cloudera Data Science Workbench (CDSW), you need to ensure your master and driver have consistent Python version. There are usually two approaches to address this problem:

If driver application runs by default in CDSW (or other container), i.e. Spark master is local, ensure CDSW or the container image has the same default Python version as your Hadoop cluster.
If you cannot change Python version in local environment, use Spark CLI (spark-submit) to submit the application to run as yarn-cluster mode. In this way, both driver and worker containers will be created in the cluster with same Python environment.

Issue context

Resolutions

In this article