Install Apache Spark 3.0.0 on Windows 10
insights Stats
Apache Spark installation guides, performance tuning tips, general tutorials, etc.
*Spark logo is a registered trademark of Apache Spark.
- Tools and Environment
- Install Git Bash
- Install Java JDK
- Install Python
- Hadoop installation (optional)
- Download binary package
- Unpack binary package
- Setup environment variables
- Config Spark default variables
- Verify the installation
- Verify spark-shell command
- Run examples
- PySpark interactive window
- Try Spark SQL
- Spark context UI
- References
- Spark developer tools
- Spark 3.0.0 overview
- Spark 3.0.0 release notes
Spark 3.0.0 was release on 18th June 2020 with many new features. The highlights of features include adaptive query execution, dynamic partition pruning, ANSI SQL compliance, significant improvements in pandas APIs, new UI for structured streaming, up to 40x speedups for calling R user-defined functions, accelerator-aware scheduler and SQL reference documentation.
This article summarizes the steps to install Spark 3.0 on your Windows 10 environment.
Tools and Environment
- GIT Bash
- Command Prompt
- Windows 10
- Python
- Java JDK
Install Git Bash
Download the latest Git Bash tool from this page: https://git-scm.com/downloads.
Run the installation wizard to complete the installation.
Install Java JDK
Spark 3.0 runs on Java 8/11. You can install Java JDK 8 based on the following section.
Step 4 - (Optional) Java JDK installation
If Java 8/11 is available in your system, you don't need install it again.
Install Python
Python is required for using PySpark. Follow these steps to install Python.
1) Download and install python from this web page: https://www.python.org/downloads/.
2) Verify installation by running the following command in Command Prompt or PowerShell:
python --version
The output looks like the following:
If python command cannot be directly invoked, please check PATH environment variable to make sure Python installation path is added:
For example, in my environment Python is installed at the following location:
Thus path C:\Users\Raymond\AppData\Local\Programs\Python\Python38-32 is added to PATH variable.
Hadoop installation (optional)
To work with Hadoop, you can configure a Hadoop single node cluster following this article:
Install Hadoop 3.3.0 on Windows 10 Step by Step Guide
Download binary package
Go to the following site:
Select the package type accordingly. I already have Hadoop 3.3.0 installed in my system, thus I selected the following:
You can choose the package with pre-built for Hadoop 3.2 or later.
Save the latest binary to your local drive. In my case, I am saving the file to folder: F:\big-data. If you are saving the file into a different location, remember to change the path in the following steps accordingly.
Unpack binary package
Open Git Bash, and change directory (cd) to the folder where you save the binary package and then unzip using the following commands:
$ mkdir spark-3.0.0 $ tar -C spark-3.0.0 -xvzf spark-3.0.0-bin-without-hadoop.tgz --strip 1
Spark 3.0 files are now extracted to F:\big-data\spark-3.0.0.
Setup environment variables
1) Setup JAVA_HOME variable.
Setup environment variable JAVA_HOME if it is not done yet. The variable value points to your Java JDK location.
2) Setup SPARK_HOME variable.
Setup SPARK_HOME environment variable with value of your spark installation directory.
3) Update PATH variable.
Added ‘%SPARK_HOME%\bin’ to your PATH environment variable.
4) Configure Spark variable SPARK_DIST_CLASSPATH.
This is only required if you configure Spark with an existing Hadoop. If your package type already includes pre-built Hadoop libraries, you don't need to do this.
Run the following command in Command Prompt to find out existing Hadoop classpath:
F:\big-data>hadoop classpath F:\big-data\hadoop-3.3.0\etc\hadoop;F:\big-data\hadoop-3.3.0\share\hadoop\common;F:\big-data\hadoop-3.3.0\share\hadoop\common\lib\*;F:\big-data\hadoop-3.3.0\share\hadoop\common\*;F:\big-data\hadoop-3.3.0\share\hadoop\hdfs;F:\big-data\hadoop-3.3.0\share\hadoop\hdfs\lib\*;F:\big-data\hadoop-3.3.0\share\hadoop\hdfs\*;F:\big-data\hadoop-3.3.0\share\hadoop\yarn;F:\big-data\hadoop-3.3.0\share\hadoop\yarn\lib\*;F:\big-data\hadoop-3.3.0\share\hadoop\yarn\*;F:\big-data\hadoop-3.3.0\share\hadoop\mapreduce\*
Setup an environment variable SPARK_DIST_CLASSPATH accordingly using the output:
Config Spark default variables
Run the following command to create a default configuration file:
cp %SPARK_HOME%/conf/spark-defaults.conf.template %SPARK_HOME%/conf/spark-defaults.conf
Open spark-defaults.conf file and add the following entries:
spark.driver.host localhost
Now Spark is available to use.
Verify the installation
Let's run some verification to ensure the installation is completed without errors.
Verify spark-shell command
Run the following command in Command Prompt to verify the installation.
spark-shell
The screen should be similar to the following screenshot:
You can use Scala in this interactive window.
Run examples
Execute the following command in Command Prompt to run one example provided as part of Spark installation (class SparkPi with param 10).
https://spark.apache.org/docs/latest/
%SPARK_HOME%\bin\run-example.cmd SparkPi 10
The output looks like the following:
PySpark interactive window
Run the following command to try PySpark:
pyspark
Python in my environment is 3.8.2.
Try Spark SQL
Spark SQL interactive window can be run through this command:
spark-sql
As I have not configured Hive in my system, thus there will be error when I run the above command.
Spark context UI
When a Spark session is running, you can view the details through UI portal. As printed out in the interactive session window, Spark context Web UI available at http://localhost:4040. The URL is based on the Spark default configurations. The port number can change if the default port is used.
The following is a screenshot of the UI:
References
Spark developer tools
Refer to the following page if you are interested in any Spark developer tools.
https://spark.apache.org/developer-tools.html
Spark 3.0.0 overview
Refer to the official documentation about Spark 3.0.0 overview: http://spark.apache.org/docs/3.0.0/.
Spark 3.0.0 release notes
https://spark.apache.org/releases/spark-release-3-0-0.html
person Orland access_time 3 years ago
Nope it didnt. BTw Raymond I managed to run my hive smoothly the other day after installation and was able to access the hiveserver2 but now when I try to connect Im able to access hive but the hive --help doesnt work and I cant connect to the hiveserver2 as well when I run these commands:
HIVE_HOME/bin/hive --service metastore &
$HIVE_HOME/bin/hive --service hiveserver2 start &
also I dont have hive in my users directory with a warehouse subfolder /user/hive/warehouse.
Nope it didnt. BTw Raymond I managed to run my hive smoothly the other day after installation and was able to access the hiveserver2 but now when I try to connect Im able to access hive but the hive --help doesnt work and I cant connect to the hiveserver2 as well when I run these commands:
HIVE_HOME/bin/hive --service metastore &
$HIVE_HOME/bin/hive --service hiveserver2 start &
also I dont have hive in my users directory with a warehouse subfolder /user/hive/warehouse.
person Raymond access_time 3 years ago
Did your Spark session crash after you see the warning message?
WARN executor.ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped.
If it doesn't crash, it is ok. There was recommendation of creating PYSPARK_PYTHON environment variable that points to your Python executable in your machine. However, since you are using spark-shell (Scala), I don't think Python matters.
Can you run the following command in Command Prompt:
getconf PAGESIZE
You should be able to see something like the following screenshot.
getconf command is provided by my Git Bash:
So if you cannot run it successfully, it suggests you have not added git bash bin folder to environment variable PATH correctly. Please do that as I suggested in the preceding comment.
Did your Spark session crash after you see the warning message?
WARN executor.ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped.
If it doesn't crash, it is ok. There was recommendation of creating PYSPARK_PYTHON environment variable that points to your Python executable in your machine. However, since you are using spark-shell (Scala), I don't think Python matters.
Can you run the following command in Command Prompt:
getconf PAGESIZE
You should be able to see something like the following screenshot.
getconf command is provided by my Git Bash:
So if you cannot run it successfully, it suggests you have not added git bash bin folder to environment variable PATH correctly. Please do that as I suggested in the preceding comment.
person Orland access_time 3 years ago
This is the warning. Thank you.
Sorry I forgot to mention that %SPARK_HOME% works with Command Prompt.
For Git Bash, please use $SPARK_HOME to access the environment variable:
For adding Git Bash bin to PATH variable: please add path C:\Program Files\Git\usr\bin to environment variable PATH. Depends on where Git is installed in your computer, please change the path accordingly. You can just directly go to Spark installation folder and then manually copy the file instead of using command.
person Orland access_time 3 years ago
How do I setup git bash in path?
This is the warning. Thank you.
person Raymond access_time 3 years ago
Hi Orland,
For copying file, have your opened a new window after setting the environment variable SPARK_HOME. For terminal opened before you set the variable, it won't be effective. Also cp command only exists in PowerShell or Git Bash or Command Prompt (when you have added Git Bash bin folder to the PATH).
For the error you got, it is actually a warning message and I think you can just ignore it. Let me know if your whole process cannot wrong because of that error. BTW, it will be helpful if you provide screenshot so that I can view all the error messages.
How do I setup git bash in path?
person Raymond access_time 3 years ago
Hi Orland,
For copying file, have your opened a new window after setting the environment variable SPARK_HOME. For terminal opened before you set the variable, it won't be effective. Also cp command only exists in PowerShell or Git Bash or Command Prompt (when you have added Git Bash bin folder to the PATH).
For the error you got, it is actually a warning message and I think you can just ignore it. Let me know if your whole process cannot wrong because of that error. BTW, it will be helpful if you provide screenshot so that I can view all the error messages.
Hi Orland,
For copying file, have your opened a new window after setting the environment variable SPARK_HOME. For terminal opened before you set the variable, it won't be effective. Also cp command only exists in PowerShell or Git Bash or Command Prompt (when you have added Git Bash bin folder to the PATH).
For the error you got, it is actually a warning message and I think you can just ignore it. Let me know if your whole process cannot wrong because of that error. BTW, it will be helpful if you provide screenshot so that I can view all the error messages.
person Orland access_time 3 years ago
Hi Raymond Im getting this error when I run Spark and pyspark in command prompt. How to fix it?
WARN executor.ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped.
I also tried typing in 'cp %SPARK_HOME%/conf/spark-defaults.conf.template %SPARK_HOME%/conf/spark-defaults.conf' in command prompt and git bash but it wasnt recognized.
Thank you.
Hi Raymond Im getting this error when I run Spark and pyspark in command prompt. How to fix it?
WARN executor.ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped.
I also tried typing in 'cp %SPARK_HOME%/conf/spark-defaults.conf.template %SPARK_HOME%/conf/spark-defaults.conf' in command prompt and git bash but it wasnt recognized.
Thank you.
If you use Derby for hive metastore, please ensure that the directory context in your command prompt is the same when you run your previous init command previously otherwise you will have to initialize the metastore again. I feel like the error you got was caused by that but I will need to look into details to be able to tell.
For the data warehouse folder, it exists in HDFS not in file system directly.