Install Hadoop 3.3.1 on Windows 10 Step by Step Guide

Install Hadoop 3.3.1 on Windows 10 Step by Step Guide

visibility 27 comment 0 access_time 15 hours ago languageEnglish

This detailed step-by-step guide shows you how to install the latest Hadoop v3.3.1 on Windows 10. It leverages Hadoop 3.3.1 winutils tool. WLS (Windows Subsystem for Linux) is not required. This version was released on June 15 2021. It is the second release of Apache Hadoop 3.3 line.

Please follow all the instructions carefully. Once you complete the steps, you will have a shiny pseudo-distributed single node Hadoop to work with.

References

Refer to the following articles if you prefer to install other versions of Hadoop or if you want to configure a multi-node cluster or using WSL.

Required tools

Before you start, make sure you have these following tools enabled in Windows 10.

ToolComments
PowerShell

We will use this tool to download package.

In my system, PowerShell version table is listed below:

$PSversionTable
Name Value
---- -----
PSVersion 5.1.19041.1237
PSEdition Desktop
PSCompatibleVersions {1.0, 2.0, 3.0, 4.0...}
BuildVersion 10.0.19041.1237
CLRVersion 4.0.30319.42000
WSManStackVersion 3.0
PSRemotingProtocolVersion 2.3
SerializationVersion 1.1.0.1
Git Bash or 7 Zip

We will use Git Bash or 7 Zip to unzip Hadoop binary package.

You can choose to install either tool or any other tool as long as it can unzip *.tar.gz files on Windows.

Command PromptWe will use it to start Hadoop daemons and run some commands as part of the installation process. 
Java JDK

JDK is required to run Hadoop as the framework is built using Java.

In my system, my JDK version is jdk1.8.0_161.

Check out the supported JDK version on the following page. 

https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions

From Hadoop 3.3.1, Java 11 runtime is also supported. 

Now we will start the installation process. 

Step 1 - Download Hadoop binary package

Select download mirror link

Go to download page of the official website:

Apache Download Mirrors - Hadoop 3.3.1

And then choose one of the mirror link. The page lists the mirrors closest to you based on your location. For me, I am choosing the following mirror link:

https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz

info In the following sections, this URL will be used to download the package.  Your URL might be different from mine and you can replace the link accordingly.

Download the package

info In this guide, I am installing Hadoop in folder big-data of my F drive (F:\big-data). If you prefer to install on another drive, please remember to change the path accordingly in the following command lines. This directory is also called destination directory in the following sections.

Open PowerShell and then run the following command lines one by one:

$dest_dir="F:\big-data"
$url = "https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz"
$client = new-object System.Net.WebClient
$client.DownloadFile($url,$dest_dir+"\hadoop-3.3.1.tar.gz")

It may take a few minutes to download. 

Once the download completes, you can verify it:

PS F:\big-data> cd $dest_dir
PS F:\big-data> ls


    Directory: F:\big-data


Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-a----        12/10/2021   7:28 PM      605187279 hadoop-3.3.1.tar.gz

You can also directly download the package through your web browser and save it to the destination directory.

warning Please keep this PowerShell window open as we will use some variables in this session in the following steps. If you already closed it, it is okay, just remember to reinitialize the above variables: $client, $dest_dir.

Step 2 - Unpack the package

Now we need to unpack the downloaded package using GUI tool (like 7 Zip) or command line. For me, I will use git bash to unpack it.

Open git bash and change the directory to the destination folder:

cd $dest_dir

And then run the following command to unzip:

tar -xvzf  hadoop-3.3.1.tar.gz

The command will take quite a few minutes as there are numerous files included and the latest version introduced many new features.

After the unzip command is completed, a new folder hadoop-3.3.1 is created under the destination folder. 


info When running the command you may experience errors like the following:
tar: Exiting with failure status due to previous errors
Please ignore it for now.

Step 3 - Install Hadoop native IO binary

Hadoop on Linux includes optional Native IO support. However Native IO is mandatory on Windows and without it you will not be able to get your installation working. The Windows native IO libraries are not included as part of Apache Hadoop release. Thus we need to build and install it.

infoThe following repository already pre-built Hadoop Windows native libraries:
https://github.com/kontext-tech/winutils 
warning These libraries are not signed and there is no guarantee that it is 100% safe. We use it purely for test&learn purpose. 

Download all the files in the following location and save them to the bin folder under Hadoop folder. For my environment, the full path is: F:\big-data\hadoop-3.3.1\bin. Remember to change it to your own path accordingly. 

https://github.com/kontext-tech/winutils/tree/master/hadoop-3.3.1/bin

After this, the bin folder looks like the following:

Step 4 - (Optional) Java JDK installation

Java JDK is required to run Hadoop. If you have not installed Java JDK, please install it.

You can install JDK 8 from the following page:

https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

Once you complete the installation, please run the following command in PowerShell or Git Bash to verify:

$ java -version
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)

If you got error about 'cannot find java command or executable'. Don't worry we will resolve this in the following step.

Step 5 - Configure environment variables

Now we've downloaded and unpacked all the artefacts we need to configure two important environment variables.

Configure JAVA_HOME environment variable

As mentioned earlier, Hadoop requires Java and we need to configure JAVA_HOME environment variable (though it is not mandatory but I recommend it).

First, we need to find out the location of Java SDK. In my system, the path is: D:\Java\jdk1.8.0_161.

Your location can be different depends on where you install your JDK.

And then run the following command in the previous PowerShell window:

SETX JAVA_HOME "D:\Java\jdk1.8.0_161" 
This article is semi-public.

Please log in or register to read the full content.

account_circle Log in person_add Register

Log in with external accounts

info Last modified by Administrator 15 hours ago copyright This page is subject to Site terms.
Related series

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

More from Kontext