Install Hadoop 3.3.0 on Linux
This article provides step-by-step guidance to install Hadoop 3.3.0 on Linux such as Debian, Ubuntu, Red Hat, openSUSE, etc. Hadoop 3.3.0 was released on July 14 2020. It is the first release of Apache Hadoop 3.3 line. There are significant changes compared with Hadoop 3.2.0, such as Java 11 runtime support, protobuf upgrade to 3.7.1, scheduling of opportunistic containers, non-volatile SCM support in HDFS cache directives, etc.
Install Java JDK
Run the following command to update package index:
sudo apt update
Check whether Java is installed already:
java -versionCommand 'java' not found, but can be installed with:
sudo apt install default-jre sudo apt install openjdk-11-jre-headless sudo apt install openjdk-8-jre-headless
Install OpenJDK via the following command:
sudo apt-get install openjdk-8-jdk
Check the version installed:
java -version openjdk version "1.8.0_191" OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.18.04.1-b12) OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
You can also use Java 11 from this version as it is now supported.
Download Hadoop binary
Go to release page of Hadoop website to find a download URL for Hadoop 3.3.0:
For me, the closest mirror is:
http://mirror.intergrid.com.au/apache/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
Run the following command in Ubuntu terminal to download a binary from the internet:
wget http://mirror.intergrid.com.au/apache/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
Wait until the download is completed:
Unzip Hadoop binary
Run the following command to create a hadoop folder under user home folder:
mkdir ~/hadoop
And then run the following command to unzip the binary package:
tar -xvzf hadoop-3.3.0.tar.gz -C ~/hadoop
Once it is unpacked, change the current directory to the Hadoop folder:
cd ~/hadoop/hadoop-3.3.0/
Configure passphraseless ssh
This step is critical and please make sure you follow the steps.
Make sure you can SSH to localhost in Ubuntu:
ssh localhost
If you cannot ssh to localhost without a passphrase, run the following command to initialize your private and public keys:
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 0600 ~/.ssh/authorized_keys
If you encounter errors like ‘ssh: connect to host localhost port 22: Connection refused’, run the following commands:
sudo apt-get install ssh
And then restart the service:
sudo service ssh restart
Configure the pseudo-distributed mode (Single-node mode)
Now, we can follow the official guide to configure a single node:
1) Setup environment variables (optional)
Setup environment variables by editing file ~/.bashrc.
vi ~/.bashrc
Add the following environment variables:
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64 export HADOOP_HOME=~/hadoop/hadoop-3.3.0 export PATH=$PATH:$HADOOP_HOME/bin export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
Run the following command to source the latest variables:
source ~/.bashrc
2) Edit etc/hadoop/hadoop-env.sh file:
vi etc/hadoop/hadoop-env.sh
Set a JAVA_HOME environment variable:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
3) Edit etc/hadoop/core-site.xml:
vi etc/hadoop/core-site.xml
Add the following configuration:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property> </configuration>
Optional: you can also configure DFS locations:
<property> <name>dfs.namenode.name.dir</name> <value>/data/dfs/namespace_logs_330</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/data/dfs/data_330</value> </property>
*Make sure the above folders are existing and also Hadoop service account has access to write and to manage.
4) Edit etc/hadoop/hdfs-site.xml:
vi etc/hadoop/hdfs-site.xml
Add the following configuration:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property> </configuration>
5) Edit file etc/hadoop/mapred-site.xml:
vi etc/hadoop/mapred-site.xml
Add the following configuration:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
</property> </configuration>
6) Edit file etc/hadoop/yarn-site.xml:
vi etc/hadoop/yarn-site.xml
Add the following configuration:
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value> </property> </configuration>
Format namenode
Run the following command to format the name node:
bin/hdfs namenode -format
Run DFS daemons
1) Run the following commands to start NameNode and DataNode daemons:
sbin/start-dfs.sh Starting namenodes on [localhost] Starting datanodes Starting secondary namenodes [raymond-pc]
2) Check status via jps command:
jps 2212 NameNode 2423 DataNode 2682 SecondaryNameNode 2829 Jps
When the services are initiated successfully, you should be able to see these four processes.
3) View name node portal
You can view the name node through the following URL:
The web UI looks like the following:
You can also view the data nodes information through menu link Datanodes:
Run YARN daemon
1) Run the following command to start YARN daemon:
sbin/start-yarn.sh
sbin/start-yarn.sh WARNING: YARN_CONF_DIR has been replaced by HADOOP_CONF_DIR. Using value of YARN_CONF_DIR. Starting resourcemanager WARNING: YARN_CONF_DIR has been replaced by HADOOP_CONF_DIR. Using value of YARN_CONF_DIR. Starting nodemanagers WARNING: YARN_CONF_DIR has been replaced by HADOOP_CONF_DIR. Using value of YARN_CONF_DIR.
2) Check status via jps command
jps 2212 NameNode 5189 NodeManager 2423 DataNode 5560 Jps 5001 ResourceManager 2682 SecondaryNameNode
Once the services are started, you can see two more processes for NodeManager and ResourceManager.
3) View YARN web portal
You can view the YARN resource manager web UI through the following URL:
The web UI looks like the following:
You can view all the applications through this web portal.
Shutdown services
Once you've completed explorations, you can use the following command to shutdown those daemons:
sbin/stop-yarn.sh sbin/stop-dfs.sh
You can verify through jps command which will only show one process now:
jps 6593 Jps
Summary
Congratulations! Now you have successfully installed a single node Hadoop 3.3.0 cluster on your Linux systems.
Have fun with Hadoop 3.3.0.
That link might not be available now. Please download directly from the release website: Apache Hadoop.
I'm glad the article is helping. If you don't specify the paths, it will use default paths.
Good luck with your configurations for a 100TB cluster.
Hi,
It means that you need to ensure the account that runs Hadoop daemon services has full access to the namenode and datanode directories.
<property> <name>dfs.namenode.name.dir</name> <value>/data/dfs/namespace_logs_330</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/data/dfs/data_330</value> </property>
For example, if you use your account to run the process, please your Linux account in WSL has write permissions to folder /data/dfs/namespace_logs_330 and /data/dfs/data_330 for the above HDFS configuration (in file etc/hadoop/core-site.xml).
For connection refused, it is usually due to the SSH services were not started correctly. As mentioned in the article, please try the following commands in WSL bash command line and then restart your Hadoop services:
sudo apt-get install ssh
sudo service ssh restart
Sometimes it might work by just simply restarting your Windows systems.
http://mirror.intergrid.com.au/apache/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
I can't open link file for get it.