This pages summarizes the steps to install the latest version 2.4.3 of Apache Spark on Windows 10 via Windows Subsystem for Linux (WSL).

Prerequisites

Follow either of the following pages to install WSL in a system or non-system drive on your Windows 10.

I also recommend you to install Hadoop 3.2.0 on your WSL following the second page.

After the above installation, your WSL should already have OpenJDK 1.8 installed.

Now let’s start to install Apache Spark 2.4.3 in WSL.

Download binary package

Visit Downloads page on Spark website to find the download URL.

image

For me, the closest location is: http://apache.mirror.serversaustralia.com.au/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz.

Download the binary package using the following command:

wget http://apache.mirror.serversaustralia.com.au/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz

Unzip the binary package

Unpack the package using the following command:

tar -xvzf spark-2.4.3-bin-hadoop2.7.tgz -C ~/hadoop

Setup environment variables

Setup SPARK_HOME environment variables and also add the bin subfolder into PATH variable.

Run the following command to change .bashrc file:

vi ~/.bashrc

Add the following lines to the end of the file:

export SPARK_HOME=~/hadoop/spark-2.4.3-bin-hadoop2.7                                                                   
export PATH=$SPARK_HOME/bin:$PATH
Source the modified file to make it effective:
source  ~/.bashrc

Now we have setup Spark correctly.

Let’s do some testings.

Run Spark interactive shell

Run the following command to start Spark shell:

spark-shell

The interface looks like the following screenshot:

image

The master is set as local[*].

Run built-in examples

Run Spark Pi example via the following command:

run-example SparkPi 10

In this website, I’ve provided many Spark examples. You can practice following those guides.

Enable Hive support

If you’ve configured Hive in WSL, follow the steps below to enable Hive support in Spark.

Copy the Hadoop core-site.xml and hdfs-site.xml and Hive hive-site.xml configuration files into Spark configuration folder:

cp $HADOOP_HOME/etc/hadoop/core-site.xml $SPARK_HOME/conf/
cp $HADOOP_HOME/etc/hadoop/hdfs-site.xml $SPARK_HOME/conf/
cp $HIVE_HOME/conf/hive-site.xml $SPARK_HOME/conf/

And then you can run Spark with Hive support (enableHiveSupport function):

from pyspark.sql import SparkSession
appName = "PySpark Hive Example" master = "local[*]" spark = SparkSession.builder \
             .appName(appName) \
             .master(master) \
             .enableHiveSupport() \
             .getOrCreate()
# Read data using Spark df = spark.sql("show databases") df.show()

For more details, please refer to this page: Read Data from Hive in Spark 1.x and 2.x.

Spark default configurations

Run the following command to create a spark default config file using the template:

cp spark-defaults.conf.template spark-defaults.conf

Update the config file with default Spark configurations. These configurations will be added when Spark jobs are submitted.

In my following configuration, I added event log directory and also Spark history log directory. 

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.

# Example:
# spark.master                     spark://master:7077
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://localhost:19000/spark-event-logs
spark.history.fs.logDirectory    hdfs://localhost:19000/spark-event-logs
# spark.serializer                 org.apache.spark.serializer.KryoSerializer
# spark.driver.memory              5g
# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"

Spark history server

Run the following command to start Spark history server:

$SPARK_HOME/sbin/start-history-server.sh

Open the history server UI (by default: http://localhost:18080/) in browser, you should be able to view all the jobs submitted. 

spark.eventLog.dir and spark.history.fs.logDirectory

These two configurations can be the same or different. The first configuration is used to write event logs when Spark application runs while the second directory is used by the historical server to read event logs. 

Have fun with Spark in WSL!

info Last modified by Raymond at 5 months ago * This page is subject to Site terms.

More from Kontext

Improve PySpark Performance using Pandas UDF with Apache Arrow

local_offer pyspark local_offer spark local_offer spark-2-x local_offer pandas

visibility 1843
thumb_up 4
access_time 7 months ago

Apache Arrow is an in-memory columnar data format that can be used in Spark to efficiently transfer data between JVM and Python processes. This currently is most beneficial to Python users that work with Pandas/NumPy data. In this article, ...

open_in_new Spark + PySpark

local_offer pyspark local_offer spark-2-x local_offer spark

visibility 2232
thumb_up 0
access_time 7 months ago

This article shows you how to read and write XML files in Spark. Sample XML file Create a sample XML file named test.xml with the following content: <?xml version="1.0"?> <data> <record id="1"> <rid>1</rid> <nam...

open_in_new Code snippets

local_offer pyspark local_offer spark-2-x local_offer spark local_offer python

visibility 2743
thumb_up 0
access_time 7 months ago

This article shows how to convert a Python dictionary list to a DataFrame in Spark using Python. Example dictionary list data = [{"Category": 'Category A', "ID": 1, "Value": 12.40}, {"Category": 'Category B', "ID": 2, "Value": 30.10}, {"Category": 'Category C', "...

open_in_new Spark + PySpark

local_offer pyspark local_offer spark-2-x local_offer spark

visibility 166
thumb_up 0
access_time 8 months ago

Sometime it is necessary to pass environment variables to Spark executors. To pass environment variable to executors, use setExecutorEnv function of SparkConf class. Code snippet In the following code snippet, an environment variable name ENV_NAME is set up with value ...

open_in_new Code snippets

info About author

comment Comments (4)

comment Add comment

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

R
Raymondarrow_drop_down

@Jonathan

I’m glad to hear that it’s now working for you.

format_quote

person Jonathan access_time 2 years ago
Re: Apache Spark 2.4.3 Installation on Windows 10 using Windows Subsystem for Linux

Hi,

Thanks for pointing out the error. I did as you suggested and it works now! Great post on setting up Apache and Hadoop in WSL!

Jonathan

reply Reply
account_circle Jonathan

Hi,

Thanks for pointing out the error. I did as you suggested and it works now! Great post on setting up Apache and Hadoop in WSL!

Jonathan

reply Reply
R
Raymondarrow_drop_down

Hi,

run-example is a command not an Scala function. The script file exists in $SPARK_HOME/bin folder. Thus please directly run it in bash (WSL terminal) instead of running it in Spark shell.

Let me know if you have other questions.

format_quote

person Jonathan access_time 2 years ago
Re: Apache Spark 2.4.3 Installation on Windows 10 using Windows Subsystem for Linux

I folllowed your instructions and installed scala after installing hadoop. But when I try to run the SparkPi example, i get the following;

scala> run-example SparkPi 10

<console>:24: error: not found: value run

       run-example SparkPi 10

       ^

<console>:24: error: not found: value example

       run-example SparkPi 10

Not sure what the error is. Thanks.

Sincerely

Jonathan

reply Reply
account_circle Jonathan

I folllowed your instructions and installed scala after installing hadoop. But when I try to run the SparkPi example, i get the following;

scala> run-example SparkPi 10

<console>:24: error: not found: value run

       run-example SparkPi 10

       ^

<console>:24: error: not found: value example

       run-example SparkPi 10

Not sure what the error is. Thanks.

Sincerely

Jonathan

reply Reply

Dark theme mode

Dark theme mode is available on Kontext.

Learn more arrow_forward

Kontext Column

Created for everyone to publish data, programming and cloud related articles. Follow three steps to create your columns.


Learn more arrow_forward