spark linux wsl

Apache Spark 2.4.3 Installation on Windows 10 using Windows Subsystem for Linux

770 views 4 comments about 4 months ago Raymond Tang

This pages summarizes the steps to install the latest version 2.4.3 of Apache Spark on Windows 10 via Windows Subsystem for Linux (WSL).

Prerequisites

Follow either of the following pages to install WSL in a system or non-system drive on your Windows 10.

I also recommend you to install Hadoop 3.2.0 on your WSL following the second page.

After the above installation, your WSL should already have OpenJDK 1.8 installed.

Now let’s start to install Apache Spark 2.4.3 in WSL.

Download binary package

Visit Downloads page on Spark website to find the download URL.

image

For me, the closest location is: http://apache.mirror.serversaustralia.com.au/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz.

Download the binary package using the following command:

wget http://apache.mirror.serversaustralia.com.au/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz

Unzip the binary package

Unpack the package using the following command:

tar -xvzf spark-2.4.3-bin-hadoop2.7.tgz -C ~/hadoop

Setup environment variables

Setup SPARK_HOME environment variables and also add the bin subfolder into PATH variable.

Run the following command to change .bashrc file:

vi ~/.bashrc

Add the following lines to the end of the file:

export SPARK_HOME=~/hadoop/spark-2.4.3-bin-hadoop2.7                                                                  

export PATH=$SPARK_HOME/bin:$PATH

Source the modified file to make it effective:

source  ~/.bashrc

Now we have setup Spark correctly.

Let’s do some testings.

Run Spark interactive shell

Run the following command to start Spark shell:

spark-shell

The interface looks like the following screenshot:

image

The master is set as local[*].

Run built-in examples

Run Spark Pi example via the following command:

run-example SparkPi 10

In this website, I’ve provided many Spark examples. You can practice following those guides.

Enable Hive support

If you’ve configured Hive in WSL, follow the steps below to enable Hive support in Spark.

Copy the Hadoop core-site.xml and hdfs-site.xml and Hive hive-site.xml configuration files into Spark configuration folder:

cp $HADOOP_HOME/etc/hadoop/core-site.xml $SPARK_HOME/conf/

cp $HADOOP_HOME/etc/hadoop/hdfs-site.xml $SPARK_HOME/conf/

cp $HIVE_HOME/conf/hive-site.xml $SPARK_HOME/conf/

And then you can run Spark with Hive support (enableHiveSupport function):

from pyspark.sql import SparkSession

appName = "PySpark Hive Example"
master = "local[*]"
spark = SparkSession.builder \
             .appName(appName) \
             .master(master) \
             .enableHiveSupport() \
             .getOrCreate()

# Read data using Spark
df = spark.sql("show databases")
df.show()

For more details, please refer to this page: Read Data from Hive in Spark 1.x and 2.x.

Have fun with Spark in WSL!

Add comment

Comments (4)

Raym*** about 3 months ago

@Jonathan

I’m glad to hear that it’s now working for you.

Jo*** about 3 months ago

Hi,

Thanks for pointing out the error. I did as you suggested and it works now! Great post on setting up Apache and Hadoop in WSL!

Jonathan

Jo*** about 3 months ago

Hi,

Thanks for pointing out the error. I did as you suggested and it works now! Great post on setting up Apache and Hadoop in WSL!

Jonathan

Raym*** about 4 months ago

Hi,

run-example is a command not an Scala function. The script file exists in $SPARK_HOME/bin folder. Thus please directly run it in bash (WSL terminal) instead of running it in Spark shell.

Let me know if you have other questions.

Jo*** about 4 months ago

I folllowed your instructions and installed scala after installing hadoop. But when I try to run the SparkPi example, i get the following;

scala> run-example SparkPi 10

<console>:24: error: not found: value run

       run-example SparkPi 10

       ^

<console>:24: error: not found: value example

       run-example SparkPi 10

Not sure what the error is. Thanks.

Sincerely

Jonathan

Jo*** about 4 months ago

I folllowed your instructions and installed scala after installing hadoop. But when I try to run the SparkPi example, i get the following;

scala> run-example SparkPi 10

<console>:24: error: not found: value run

       run-example SparkPi 10

       ^

<console>:24: error: not found: value example

       run-example SparkPi 10

Not sure what the error is. Thanks.

Sincerely

Jonathan