Apache Spark 3.0.0 Installation on Linux Guide
- Prerequisites
- Windows Subsystem for Linux (WSL)
- Hadoop 3.3.0
- OpenJDK 1.8
- Download binary package
- Unpack the binary package
- Setup environment variables
- Setup Spark default configurations
- spark.eventLog.dir and spark.history.fs.logDirectory
- Run Spark interactive shell
- Run with built-in examples
- Spark context Web UI
- Enable Hive support
- Spark history server
This article provides step by step guide to install the latest version of Apache Spark 3.0.0 on a UNIX alike system (Linux) or Windows Subsystem for Linux (WSL). These instructions can be applied to Ubuntu, Debian, Red Hat, OpenSUSE, MacOS, etc.
Prerequisites
Windows Subsystem for Linux (WSL)
If you are planning to configure Spark 3.0 on WSL, follow this guide to setup WSL in your Windows 10 machine:
Install Windows Subsystem for Linux on a Non-System Drive
Hadoop 3.3.0
This article will use Spark package without pre-built Hadoop. Thus we need to ensure a Hadoop environment is setup first.
If you choose to download Spark package with pre-built Hadoop, Hadoop 3.3.0 configuration is not required.
Follow one of the following articles to install Hadoop 3.3.0 on your UNIX-alike system:
OpenJDK 1.8
Java JDK 1.8 needs to be available in your system.
In the Hadoop installation articles, it includes the steps to install OpenJDK.
Run the following command to verify Java environment:
$ java -version openjdk version "1.8.0_212" OpenJDK Runtime Environment (build 1.8.0_212-8u212-b03-0ubuntu1.18.04.1-b03) OpenJDK 64-Bit Server VM (build 25.212-b03, mixed mode)
Now let’s start to configure Apache Spark 3.0.0 in a UNIX-alike system.
Download binary package
Visit Downloads page on Spark website to find the download URL.
For me, the closest location is: http://apache.mirror.amaze.com.au/spark/spark-3.0.0/spark-3.0.0-bin-without-hadoop.tgz.
Download the binary package using the following command:
wget http://apache.mirror.amaze.com.au/spark/spark-3.0.0/spark-3.0.0-bin-without-hadoop.tgz
Unpack the binary package
Unpack the package using the following command:
mkdir ~/hadoop/spark-3.0.0 tar -xvzf spark-3.0.0-bin-without-hadoop.tgz -C ~/hadoop/spark-3.0.0 --strip 1
The Spark binaries are unzipped to folder ~/hadoop/spark-3.0.0.
Setup environment variables
Setup SPARK_HOME environment variables and also add the bin subfolder into PATH variable. We also need to configure Spark environment variable SPARK_DIST_CLASSPATH to use Hadoop Java class path.
Run the following command to change .bashrc file:
vi ~/.bashrc
Add the following lines to the end of the file:
export SPARK_HOME=~/hadoop/spark-3.0.0
export PATH=$SPARK_HOME/bin:$PATH # Configure Spark to use Hadoop classpath export SPARK_DIST_CLASSPATH=$(hadoop classpath)
# Source the modified file to make it effective:
source ~/.bashrc
Setup Spark default configurations
Run the following command to create a Spark default config file:
cp $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf
Edit the file to add some configurations use the following commands:
vi $SPARK_HOME/conf/spark-defaults.conf
Make sure you add the following line:
spark.driver.host localhost
There are many other configurations you can do. Please configure them as necessary.
spark.eventLog.dir and spark.history.fs.logDirectory
These two configurations can be the same or different. The first configuration is used to write event logs when Spark application runs while the second directory is used by the historical server to read event logs.
Now let's do some verifications to ensure it is working.
Run Spark interactive shell
Run the following command to start Spark shell:
spark-shell
The interface looks like the following screenshot:
By default, Spark master is set as local[*] in the shell.
Run with built-in examples
Run Spark Pi example via the following command:
run-example SparkPi 10
The output looks like the following:
In this website, I’ve provided many Spark examples. You can practice following those guides.
Spark context Web UI
When a Spark session is running, you can view the details through UI portal. As printed out in the interactive session window, Spark context Web UI available at http://localhost:4040. The URL is based on the Spark default configurations. The port number can change if the default port is used.
The following is a screenshot of the UI:
Enable Hive support
If you’ve configured Hive in WSL, follow the steps below to enable Hive support in Spark.
Copy the Hadoop core-site.xml and hdfs-site.xml and Hive hive-site.xml configuration files into Spark configuration folder:
cp $HADOOP_HOME/etc/hadoop/core-site.xml $SPARK_HOME/conf/
cp $HADOOP_HOME/etc/hadoop/hdfs-site.xml $SPARK_HOME/conf/
cp $HIVE_HOME/conf/hive-site.xml $SPARK_HOME/conf/
And then you can run Spark with Hive support (enableHiveSupport function):
from pyspark.sql import SparkSession
appName = "PySpark Hive Example" master = "local[*]" spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.enableHiveSupport() \
.getOrCreate()
# Read data using Spark df = spark.sql("show databases") df.show()
For more details, please refer to this page: Read Data from Hive in Spark 1.x and 2.x.
Spark history server
Run the following command to start Spark history server:
$SPARK_HOME/sbin/start-history-server.sh
Open the history server UI (by default: http://localhost:18080/) in browser, you should be able to view all the jobs submitted.
when i lanch spark-shell,i get this error:
Spark session available as 'spark'.
Exception in thread "main" java.lang.NoSuchMethodError: jline.console.completer.CandidateListCompletionHandler.setPrintSpaceAfterFullCompletion(Z)V
at scala.tools.nsc.interpreter.jline.JLineConsoleReader.initCompletion(JLineReader.scala:143)
at scala.tools.nsc.interpreter.jline.InteractiveReader.postInit(JLineReader.scala:58)
at org.apache.spark.repl.SparkILoop.$anonfun$process$3(SparkILoop.scala:144)
at org.apache.spark.repl.SparkILoop.$anonfun$process$3$adapted(SparkILoop.scala:142)
at scala.tools.nsc.interpreter.SplashReader.postInit(InteractiveReader.scala:142)
at org.apache.spark.repl.SparkILoop.$anonfun$process$4(SparkILoop.scala:168)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.tools.nsc.interpreter.ILoop.$anonfun$mumly$1(ILoop.scala:166)
at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:206)
at scala.tools.nsc.interpreter.ILoop.mumly(ILoop.scala:163)
at org.apache.spark.repl.SparkILoop.loopPostInit$1(SparkILoop.scala:153)
at org.apache.spark.repl.SparkILoop.$anonfun$process$10(SparkILoop.scala:221)
at org.apache.spark.repl.SparkILoop.withSuppressedSettings$1(SparkILoop.scala:189)
at org.apache.spark.repl.SparkILoop.startup$1(SparkILoop.scala:201)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:236)
at org.apache.spark.repl.Main$.doMain(Main.scala:78)
at org.apache.spark.repl.Main$.main(Main.scala:58)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1020)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1111)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
2024-06-12 15:50:15,272 INFO spark.SparkContext: Invoking stop() from shutdown hook
2024-06-12 15:50:15,273 INFO spark.SparkContext: SparkContext is stopping with exitCode 0.
2024-06-12 15:50:15,298 INFO server.AbstractConnector: Stopped Spark@3e42e286{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
2024-06-12 15:50:15,302 INFO ui.SparkUI: Stopped Spark web UI at http://zookeeper3:4040
2024-06-12 15:50:15,335 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
2024-06-12 15:50:15,359 INFO memory.MemoryStore: MemoryStore cleared
2024-06-12 15:50:15,360 INFO storage.BlockManager: BlockManager stopped
2024-06-12 15:50:15,376 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
2024-06-12 15:50:15,380 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
2024-06-12 15:50:15,402 INFO spark.SparkContext: Successfully stopped SparkContext
2024-06-12 15:50:15,402 INFO util.ShutdownHookManager: Shutdown hook called
2024-06-12 15:50:15,403 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-161e2c67-9bb8-4e6f-beca-cc98e284b40f
2024-06-12 15:50:15,414 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-f60046ba-96e4-48c9-b25b-bbe5ba6a971a/repl-41e59a2c-a7af-46c7-b6fd-c374b4cc7c87