Install Hadoop 3.2.0 on Windows 10 using Windows Subsystem for Linux (WSL)

access_time 2 years ago visibility17548 comment 16

In my previous post, I showed how to configure a single node Hadoop instance on Windows 10. The steps are not too difficult to follow if you have Java programming background. However there is one step that is not very straightforward: native Hadoop executable (winutils.exe) is not included in the official Hadoop distribution and needs to be downloaded separately or built locally.  In Linux or UNIX, you don’t usually need to do that since the native libs are pre-compiled and included in the binary distribution.

In August 2016, Microsoft has published the initial release of Windows Subsystem for Linux (WSL). In Jun this year, WSL 2.0 will also be released with enhanced performance. With WSL, we can run Linux as subsystem in Windows 10. In this post, I am going to show you how to install Hadoop 3.2.0 in WSL.

Hadoop 3.2.1
If you prefer to install the latest Hadoop 3.2.1 on Windows using native Windows HDFS, please follow this article:

Latest Hadoop 3.2.1 Installation on Windows 10 Step by Step Guide

If you'd like to build Hadoop 3.2.1 on Windows 10, please follow this article:

Compile and Build Hadoop 3.2.1 on Windows 10 Guide

Prerequisites

Follow the page below to enable WSL and then install one of the Linux systems from Microsoft Store.

Windows Subsystem for Linux Installation Guide for Windows 10

To be specific, enable WSL by running the following PowerShell code as Administrator (or enable it through Control Panel):

Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux

And then install Ubuntu from Microsoft Store.

image

image

Once download is completed, click Launch button to lunch the application. It make take a few minutes to install:

image

During the installation, you need to input a username and password. Once it is done, you are ready to use the Ubuntu terminal:

image

Install Hadoop 3.2.0 in WSL

Install Java JDK

Run the following command to update package index:

sudo apt update

Check whether Java is installed already:

java -version

Command 'java' not found, but can be installed with:

sudo apt install default-jre
sudo apt install openjdk-11-jre-headless
sudo apt install openjdk-8-jre-headless

Install OpenJDK via the following command:

sudo apt-get install openjdk-8-jdk

Check the version installed:

java -version
openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.18.04.1-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)

*Java 11 is not supported yet by Hadoop as at 2019-05-11.

Download Hadoop binary

Go to release page of Hadoop website to find a download URL for Hadoop 3.2.0:

Hadoop Releases

For me, the close mirror is:

http://mirror.intergrid.com.au/apache/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz

Run the following command in Ubuntu terminal to download a binary from the internet:

wget http://mirror.intergrid.com.au/apache/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz

Unzip Hadoop binary

Run the following command to create a hadoop folder under home folder:

mkdir ~/hadoop

And then run the following command to unzip the binary package:

tar -xvzf hadoop-3.2.0.tar.gz -C ~/hadoop

Once it is unzipped, change the current directory to the hadoop folder:

cd ~/hadoop/hadoop-3.2.0/

Configure passphraseless ssh

This step is critical and please make sure you follow the steps.

Make sure you can SSH to localhost in Ubuntu:

ssh localhost

If you cannot ssh to localhost without a passphrase, run the following command to initialize your private and public keys:

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

If you encounter errors like ‘ssh: connect to host localhost port 22: Connection refused’, run the following commands:

sudo apt-get install ssh
And then restart the service:
sudo service ssh restart

If the above commands still don’t work, try the solution in this comment.

Configure the pseudo-distributed mode (Single-node mode)

Now, we can follow the official guide to configure a single node:

Pseudo-Distributed Operation

The steps are very similar to the ones in my previous post.

Edit etc/hadoop/hadoop-env.sh file:

vi etc/hadoop/hadoop-env.sh

Set a JAVA_HOME environment variable:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Edit etc/hadoop/core-site.xml:

vi etc/hadoop/core-site.xml

Add the following configuration:

<configuration>
     <property>
         <name>fs.defaultFS</name>
         <value>hdfs://localhost:9000</value>
     </property> </configuration>

Edit etc/hadoop/hdfs-site.xml:

vi etc/hadoop/hdfs-site.xml

Add the following configuration:

<configuration>
     <property>
         <name>dfs.replication</name>
         <value>1</value>
     </property> </configuration>

Edit file etc/hadoop/mapred-site.xml:

vi etc/hadoop/mapred-site.xml

Add the following configuration:

<configuration>
     <property>
         <name>mapreduce.framework.name</name>
         <value>yarn</value>
     </property>
     <property>
         <name>mapreduce.application.classpath</name>
         <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
     </property> </configuration>

Edit file etc/hadoop/yarn-site.xml:

vi etc/hadoop/yarn-site.xml

Add the following configuration:

<configuration>
     <property>
         <name>yarn.nodemanager.aux-services</name>
         <value>mapreduce_shuffle</value>
     </property>
     <property>
         <name>yarn.nodemanager.env-whitelist</name>
         <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
     </property> </configuration>

Format namenode

Run the following command to format the name node:

bin/hdfs namenode -format

Run DFS daemons

Run the following commands to start NameNode and DataNode daemons:

sbin/start-dfs.sh
tangr@Raymond-Alienware:~/hadoop/hadoop-3.2.0$ sbin/start-dfs.sh Starting namenodes on [localhost] Starting datanodes Starting secondary namenodes [Raymond-Alienware] Raymond-Alienware: Warning: Permanently added 'raymond-alienware' (ECDSA) to the list of known hosts.

You can view the name node through the following URL:

http://localhost:9870/dfshealth.html#tab-overview

The web UI looks like the following:

image

Run YARN daemon

Run the following command to start YARN daemon:

sbin/start-yarn.sh
tangr@Raymond-Alienware:~/hadoop/hadoop-3.2.0$ sbin/start-yarn.sh
Starting resourcemanager
Starting nodemanagers

Once the services are started, you can view the YARN resource manager web UI through the following URL:

http://localhost:8088/cluster

The web UI looks like the following:

image

Unhealthy nodes

As I am currently run the WLS Ubuntu terminal in C drive and my C drive is almost full (available capacity is lower than 10%); thus the single node is not started successfully.

For more details, refer to my post: Hadoop on Windows - UNHEALTHY Data Nodes Fix.

You can also install WSL Ubuntu in other drive (instead of C drive). 

Refer to the official guide to learn how to manually install WSL in a non-system drive:

 Install Windows Subsystem for Linux on a Non-System Drive

org.apache.hadoop.http.HttpServer2: HttpServer.start() threw a non Bind IOException java.net.SocketException: Permission denied

You may encounter this issue:

INFO org.apache.hadoop.http.HttpServer2: HttpServer.start() threw a non Bind IOException
java.net.SocketException: Permission denied
     at sun.nio.ch.Net.bind0(Native Method)
     at sun.nio.ch.Net.bind(Net.java:433)
     at sun.nio.ch.Net.bind(Net.java:425)
     at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)

Name node service cannot be started as socket bind cannot be established. As we are not using privileged ports in core-site configuration, I could not find out the root cause for this one yet. However after I restart my Windows computer, this issue is resolved automatically. 

Environment variables

To make it easier to run Hadoop commands, add the following environment variables into .bashrc file in your home folder:

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
export HADOOP_HOME=/home/tangr/hadoop/hadoop-3.2.0
export PATH=$PATH:$HADOOP_HOME/bin

*Remember to change the highlighted part to your own user name in the Linux system.

Summary

Congratulations! Now you have successfully installed a single node Hadoop 3.2.0 cluster in your Ubuntu subsystem of Windows 10. It’s relatively easier as we don’t need to download or compile/build native Hadoop libraries.

BTW, subsystem is not a virtual machine however it provides you almost the same experience as you would have in a native Linux system.

Have fun!

info Last modified by Administrator at 2 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Kontext Column

Created for everyone to publish data, programming and cloud related articles.
Follow three steps to create your columns.


Learn more arrow_forward

More from Kontext

local_offer zeppelin local_offer spark local_offer hadoop local_offer linux local_offer sqoop local_offer hive local_offer WSL

visibility 1219
thumb_up 0
access_time 2 years ago

This page summarizes the installation guides about big data tools on Windows through Windows Subsystem for Linux (WSL). Install Hadoop 3.2.0 on Windows 10 using Windows Subsystem for Linux (WSL) A framework that allows for distributed processing of the large data sets ...

Apache Hive 3.1.1 Installation on Windows 10 using Windows Subsystem for Linux

local_offer hadoop local_offer hive local_offer WSL local_offer big-data-on-wsl

visibility 4334
thumb_up 0
access_time 2 years ago

Previously, I demonstrated how to configured Apache Hive 3.0.0 on Windows 10. Apache Hive 3.0.0 Installation on Windows 10 Step by Step Guide On this page, I’m going to show you how to install the latest version Apache Hive 3.1.1 on Windows 10 using Windows Subsystem for Linux (WSL) Ubuntu ...

local_offer zeppelin local_offer spark local_offer hadoop local_offer linux local_offer sqoop local_offer hive local_offer WSL

visibility 1219
thumb_up 0
access_time 2 years ago

This page summarizes the installation guides about big data tools on Windows through Windows Subsystem for Linux (WSL). Install Hadoop 3.2.0 on Windows 10 using Windows Subsystem for Linux (WSL) A framework that allows for distributed processing of the large data sets ...

About column

Articles about Apache Hadoop installation, performance tuning and general tutorials.

*The yellow elephant logo is a registered trademark of Apache Hadoop.

rss_feed Subscribe RSS