This pages summarizes the steps to install the latest version 2.4.3 of Apache Spark on Windows 10 via Windows Subsystem for Linux (WSL).
Follow either of the following pages to install WSL in a system or non-system drive on your Windows 10.
- Install Windows Subsystem for Linux on a Non-System Drive
- Install Hadoop 3.2.0 on Windows 10 using Windows Subsystem for Linux (WSL)
I also recommend you to install Hadoop 3.2.0 on your WSL following the second page.
After the above installation, your WSL should already have OpenJDK 1.8 installed.
Now let’s start to install Apache Spark 2.4.3 in WSL.
Download binary package
Visit Downloads page on Spark website to find the download URL.
For me, the closest location is: http://apache.mirror.serversaustralia.com.au/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz.
Download the binary package using the following command:
Unzip the binary package
Unpack the package using the following command:
tar -xvzf spark-2.4.3-bin-hadoop2.7.tgz -C ~/hadoop
Setup environment variables
Setup SPARK_HOME environment variables and also add the bin subfolder into PATH variable.
Run the following command to change .bashrc file:
Add the following lines to the end of the file:
Source the modified file to make it effective:
Now we have setup Spark correctly.
Let’s do some testings.
Run Spark interactive shell
Run the following command to start Spark shell:
The interface looks like the following screenshot:
The master is set as local[*].
Run built-in examples
Run Spark Pi example via the following command:
run-example SparkPi 10
In this website, I’ve provided many Spark examples. You can practice following those guides.
Enable Hive support
If you’ve configured Hive in WSL, follow the steps below to enable Hive support in Spark.
Copy the Hadoop core-site.xml and hdfs-site.xml and Hive hive-site.xml configuration files into Spark configuration folder:
cp $HADOOP_HOME/etc/hadoop/core-site.xml $SPARK_HOME/conf/
cp $HADOOP_HOME/etc/hadoop/hdfs-site.xml $SPARK_HOME/conf/
cp $HIVE_HOME/conf/hive-site.xml $SPARK_HOME/conf/
And then you can run Spark with Hive support (enableHiveSupport function):
from pyspark.sql import SparkSession
appName = "PySpark Hive Example" master = "local[*]" spark = SparkSession.builder \
# Read data using Spark df = spark.sql("show databases") df.show()
For more details, please refer to this page: Read Data from Hive in Spark 1.x and 2.x.
Have fun with Spark in WSL!