Install Hadoop 3.2.0 on Windows 10 using Windows Subsystem for Linux (WSL)
- Prerequisites
- Install Hadoop 3.2.0 in WSL
- Install Java JDK
- Download Hadoop binary
- Unzip Hadoop binary
- Configure passphraseless ssh
- Configure the pseudo-distributed mode (Single-node mode)
- Format namenode
- Run DFS daemons
- Run YARN daemon
- Unhealthy nodes
- org.apache.hadoop.http.HttpServer2: HttpServer.start() threw a non Bind IOException java.net.SocketException: Permission denied
- Environment variables
- Summary
In my previous post, I showed how to configure a single node Hadoop instance on Windows 10. The steps are not too difficult to follow if you have Java programming background. However there is one step that is not very straightforward: native Hadoop executable (winutils.exe) is not included in the official Hadoop distribution and needs to be downloaded separately or built locally. In Linux or UNIX, you don’t usually need to do that since the native libs are pre-compiled and included in the binary distribution.
In August 2016, Microsoft has published the initial release of Windows Subsystem for Linux (WSL). In Jun this year, WSL 2.0 will also be released with enhanced performance. With WSL, we can run Linux as subsystem in Windows 10. In this post, I am going to show you how to install Hadoop 3.2.0 in WSL.
If you prefer to install the latest Hadoop 3.2.1 on Windows using native Windows HDFS, please follow this article:
Latest Hadoop 3.2.1 Installation on Windows 10 Step by Step Guide
If you'd like to build Hadoop 3.2.1 on Windows 10, please follow this article:
Compile and Build Hadoop 3.2.1 on Windows 10 Guide
Prerequisites
Follow the page below to enable WSL and then install one of the Linux systems from Microsoft Store.
Windows Subsystem for Linux Installation Guide for Windows 10
To be specific, enable WSL by running the following PowerShell code as Administrator (or enable it through Control Panel):
Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux
And then install Ubuntu from Microsoft Store.
Once download is completed, click Launch button to lunch the application. It make take a few minutes to install:
During the installation, you need to input a username and password. Once it is done, you are ready to use the Ubuntu terminal:
Install Hadoop 3.2.0 in WSL
Install Java JDK
Run the following command to update package index:
sudo apt update
Check whether Java is installed already:
java -versionCommand 'java' not found, but can be installed with:
sudo apt install default-jre sudo apt install openjdk-11-jre-headless sudo apt install openjdk-8-jre-headless
Install OpenJDK via the following command:
sudo apt-get install openjdk-8-jdk
Check the version installed:
java -version openjdk version "1.8.0_191" OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.18.04.1-b12) OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
*Java 11 is not supported yet by Hadoop as at 2019-05-11.
Download Hadoop binary
Go to release page of Hadoop website to find a download URL for Hadoop 3.2.0:
For me, the close mirror is:
http://mirror.intergrid.com.au/apache/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz
Run the following command in Ubuntu terminal to download a binary from the internet:
wget http://mirror.intergrid.com.au/apache/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz
Unzip Hadoop binary
Run the following command to create a hadoop folder under home folder:
mkdir ~/hadoop
And then run the following command to unzip the binary package:
tar -xvzf hadoop-3.2.0.tar.gz -C ~/hadoop
Once it is unzipped, change the current directory to the hadoop folder:
cd ~/hadoop/hadoop-3.2.0/
Configure passphraseless ssh
This step is critical and please make sure you follow the steps.
Make sure you can SSH to localhost in Ubuntu:
ssh localhost
If you cannot ssh to localhost without a passphrase, run the following command to initialize your private and public keys:
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 0600 ~/.ssh/authorized_keys
If you encounter errors like ‘ssh: connect to host localhost port 22: Connection refused’, run the following commands:
sudo apt-get install ssh
And then restart the service:
sudo service ssh restart
If the above commands still don’t work, try the solution in this comment.
Configure the pseudo-distributed mode (Single-node mode)
Now, we can follow the official guide to configure a single node:
The steps are very similar to the ones in my previous post.
Edit etc/hadoop/hadoop-env.sh file:
vi etc/hadoop/hadoop-env.sh
Set a JAVA_HOME environment variable:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Edit etc/hadoop/core-site.xml:
vi etc/hadoop/core-site.xml
Add the following configuration:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property> </configuration>
Edit etc/hadoop/hdfs-site.xml:
vi etc/hadoop/hdfs-site.xml
Add the following configuration:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property> </configuration>
Edit file etc/hadoop/mapred-site.xml:
vi etc/hadoop/mapred-site.xml
Add the following configuration:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
</property> </configuration>
Edit file etc/hadoop/yarn-site.xml:
vi etc/hadoop/yarn-site.xml
Add the following configuration:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property> </configuration>
Format namenode
Run the following command to format the name node:
bin/hdfs namenode -format
Run DFS daemons
Run the following commands to start NameNode and DataNode daemons:
sbin/start-dfs.sh
tangr@Raymond-Alienware:~/hadoop/hadoop-3.2.0$ sbin/start-dfs.sh Starting namenodes on [localhost] Starting datanodes Starting secondary namenodes [Raymond-Alienware] Raymond-Alienware: Warning: Permanently added 'raymond-alienware' (ECDSA) to the list of known hosts.
You can view the name node through the following URL:
The web UI looks like the following:
Run YARN daemon
Run the following command to start YARN daemon:
sbin/start-yarn.sh
tangr@Raymond-Alienware:~/hadoop/hadoop-3.2.0$ sbin/start-yarn.sh Starting resourcemanager Starting nodemanagers
Once the services are started, you can view the YARN resource manager web UI through the following URL:
The web UI looks like the following:
Unhealthy nodes
As I am currently run the WLS Ubuntu terminal in C drive and my C drive is almost full (available capacity is lower than 10%); thus the single node is not started successfully.
For more details, refer to my post: Hadoop on Windows - UNHEALTHY Data Nodes Fix.
You can also install WSL Ubuntu in other drive (instead of C drive).
Refer to the official guide to learn how to manually install WSL in a non-system drive:
org.apache.hadoop.http.HttpServer2: HttpServer.start() threw a non Bind IOException java.net.SocketException: Permission denied
You may encounter this issue:
INFO org.apache.hadoop.http.HttpServer2: HttpServer.start() threw a non Bind IOException
java.net.SocketException: Permission denied
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:433)
at sun.nio.ch.Net.bind(Net.java:425)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
Name node service cannot be started as socket bind cannot be established. As we are not using privileged ports in core-site configuration, I could not find out the root cause for this one yet. However after I restart my Windows computer, this issue is resolved automatically.
If restart doesn't help, try this approach: java.net.SocketException: Permission denied.
Environment variables
To make it easier to run Hadoop commands, add the following environment variables into .bashrc file in your home folder:
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
export HADOOP_HOME=/home/tangr/hadoop/hadoop-3.2.0
export PATH=$PATH:$HADOOP_HOME/bin
*Remember to change the highlighted part to your own user name in the Linux system.
Summary
Congratulations! Now you have successfully installed a single node Hadoop 3.2.0 cluster in your Ubuntu subsystem of Windows 10. It’s relatively easier as we don’t need to download or compile/build native Hadoop libraries.
BTW, subsystem is not a virtual machine however it provides you almost the same experience as you would have in a native Linux system.
Have fun!
It seems that your config XML file encoding is not correct or file content is not complete.
Can you please make sure the binary package is downloaded successfully and also all the content is extracted properly?
You should be able to see the following content in Hadoop folder:
~/hadoop/hadoop-3.2.0$ ls LICENSE.txt NOTICE.txt README.txt bin etc include lib libexec logs sbin share
BTW, from the screenshot, I can see your Hadoop version is 3.2.2 instead of 3.2.0. Technically that should be a problem but I have not tested Hadoop 3.2.2 on WSL.
Restart my Windows computer also has the problem:
org.apache.hadoop.http.HttpServer2: HttpServer.start() threw a non Bind IOException java.net.SocketException: Permission denied
how to we execute "hdfs fsck" command, it's giving regarding the file system commands
Hi,
Can you add more details about the question? I am not sure whether I understand correctly. If you want to execute that command, you can directly run it in bash/Terminal.
https://hadoop.apache.org/docs/r3.0.0/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#fsck
You mention "native Hadoop executable (winutils.exe) is not included in the official Hadoop distribution and needs to be downloaded separately [or built locally]." Do you happen to know where this can be downloaded for 3.2.0?
Hi, please try this repository https://github.com/steveloughran/winutils
Unfortunately, this is only updated to Hadoop 3.0.0.
I'm not sure whether it works for 3.2.0. You can give it a try. Please pay attention to the README.md file about the author's declaration.
Thanks. I also got the issue when I was writing the post. However I was able to resolve it by running the following two commands only:
sudo apt-get install ssh
sudo service ssh restart
Just in case other people cannot resolve the problem, I've updated the post to include the link to your comment so that they can follow that guide to resolve the connection issue.
ssh localhost
Connection closed by 127.0.0.1 port 22
http://localhost:9870/dfshealth.html#tab-overview not working
Have you tried the steps I mentioned in the post?
sudo apt-get install ssh
sudo service ssh restart
I'm not expert in network and I don't think the following solution will definitely help as they are all local traffics. I cannot reproduce this issue in my environment, so it will be hard to say where it goes wrong in your environment.
There must be some other reasons that you cannot ssh localhost. For example, is port 22 used by your other programs?
Can you also please try the same approach to allow ssh connections?
The websites won't start successfully until you resolve the ssh issue. So make sure you can ssh localhost first.
- Protocol type: TCP
- Local port: 22
- Remote port: All Ports
- Scope: make sure all your local IP addresses are added.
- Profiles: Private. I'm choosing this one I will only connect to my wSL when connecting to private network.
I installed ssh and restarted it. Now 'ssh localhost' just says 'Connection closed by ::1 port 22.'
Have you tried the solution I mentioned in the post? I got the same issue when it is first installed but after the following commands, it work. And also make sure you stop and restart hadoop daemons.
sudo apt-get install ssh
sudo service ssh restart
I'm not expert in network and I don't think the following solution will definitely help as they are all local traffics. There must be some other reasons that you cannot ssh localhost. For example, is port 22 used by your other programs? Can you also use IPv4 addresses for localhost instead of the IPv6 one?
Can you try to add firewall rule to allow TCP traffic to ssh port 22?
- Protocol type: TCP
- Local port: 22
- Remote port: All Ports
- Scope: make sure all your local IP addresses are added.
- Profiles: Private. I'm choosing this one I will only connect to my wSL when connecting to private network.
I get Permission Denied when trying to get hadoop binary. after research I found that I need to use sudo in front of command. So need to use
sudo wget http://mirrors.....
Thanks for great article!
In my case, the command:
sbin/start-dfs.sh
is executed without errors, but the NameNode is not started and therefore it is not responding on http://localhost:9870.
Executing jps command I can see how running processes are:
1) SecondaryNameNode
2) DataNode
3) Jps
NameNode process is missing from the returned list.
Any idea on what can be going wrong?
I followed all the instructions in this guide to configure my WSL environment.
Thanks
I am getting the error shown in above image when running the above command.
I have successfully ssh to localhost.
And the other problem is I can successfully enter into hadoop directory by typing cd hadoop but when i try to do "ls" i can not see the hadoop directory in the list.