Load File into HDFS through WebHDFS APIs

access_time 2 months ago visibility39 comment 0

To ingest data into HDFS, one of the commonly used approach is to upload files into a temporary folder in one of the Edge server of Hadoop cluster, where HDFS CLIs are available to copy file from local to the distributed file system.

In the past, I've published several related articles about WebHDFS:

This article will expand further on WebHDFS. This article uses curl command to call these APIs. You can implement these web requests using your preferable programming language too like Python, Java, C#, Go, etc. 

Hadoop environment

To test these APIs, I am using the Hadoop single node cluster setup based on the following guide:

Install Hadoop 3.3.0 on Windows 10 Step by Step Guide

For this environment, the host is localhost and port of name node is 9870.

Create a directory in HDFS

Let's first create a directory to see how the web RESTful APIs work.

The syntax of calling this API is:

curl -i -X PUT "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=MKDIRS[&permission=<OCTAL>]"

Replace those parameters accordingly based on your environment and then run the command in a terminal that supports curl command. 

In my environment, the following command is run:

curl -i -X PUT "http://localhost:9870/webhdfs/v1/test?op=MKDIRS&permission=755"

The command line returned with permission error as the following screenshot shows:


As I have not done any authentication, the default web user dr.who is used which was then rejected.

Permission denied: user=dr.who, access=WRITE

To fix this issue, we can add request user name in the URL:

curl -i -X PUT "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=MKDIRS&user.name=<USER>[&permission=<OCTAL>]"
Replace user to your Hadoop user name and the command will run successfully if security is off.
The following is a sample HTTP request response:
HTTP/1.1 200 OK
Date: Sat, 22 Aug 2020 11:42:41 GMT
Cache-Control: no-cache
Expires: Sat, 22 Aug 2020 11:42:41 GMT
Date: Sat, 22 Aug 2020 11:42:41 GMT
Pragma: no-cache
X-Content-Type-Options: nosniff
X-FRAME-OPTIONS: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Set-Cookie: hadoop.auth="u=***&p=***&t=simple&e=1598132561905&s=1Y3XzIZ7zBNyzsWECIT27gae+IDcUvpvEhwaScJRE48="; Path=/; HttpOnly
Content-Type: application/json
Transfer-Encoding: chunked

As part of the response, it also sets up a client cookie for Hadoop authentication.

Hadoop security mode

In my Hadoop cluster, the security mode is turned off which is why I can directly call the API with specified user. This is not secured in an enterprise environment as everyone can perform all the actions and access to all the data using this approach.

To enable Kerberos authentication, the following configurations need to be added in Hadoop configuration file (dfs.xml and core-site.xml):

  • dfs.web.authentication.kerberos.principal: The HTTP Kerberos principal used by Hadoop-Auth in the HTTP endpoint. The HTTP Kerberos principal MUST start with ‘HTTP/’ per Kerberos HTTP SPNEGO specification. A value of “*” will use all HTTP principals found in the keytab.
  • dfs.web.authentication.kerberos.keytabThe Kerberos keytab file with the credentials for the HTTP Kerberos principal used by Hadoop-Auth in the HTTP endpoint.
  • hadoop.http.authentication.type: Defines authentication used for the HTTP web-consoles. The supported values are: simple | kerberos | #AUTHENTICATION_HANDLER_CLASSNAME#.You can customize authentication handler which provides opportunity to use other middleware/tools to perform authentications, for example Apache Ranger.  

When security is enabled, all requests need to use client that supports Kerberos HTTP SPNEGO authentication.

For curl command line, refer to this page for more details about how to initialize Kerberos configuration and also use it for requests.

For programming languages/frameworks, there are usually packages available to perform HTTP SPNEGO. For example, package requests-gssapi can be used in Python.

You can also use delegation token when security is enabled.  Delegation token can be retrieved using the following command:

curl -i "http://<HOST>:<PORT>/webhdfs/v1/?op=GETDELEGATIONTOKEN[&renewer=<USER>][&service=<SERVICE>][&kind=<KIND>]"
infoToken will only be issued when security is on. 

Ingest file into HDFS

CREATE operation can be used to upload a file into HDFS. There are two steps required:

1) Get the data node location

curl -i -X PUT "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CREATE
                    [&overwrite=<true |false>][&blocksize=<LONG>][&replication=<SHORT>]
                    [&permission=<OCTAL>][&buffersize=<INT>][&noredirect=<true|false>]"

In my environment, I run the following command to find the data node location:

curl -i -X PUT "http://localhost:9870/webhdfs/v1/test/test.csv?op=CREATE&overwrite=true&noredirect=false"

The command line returns Location in the response body as the following screenshot shows:


2) Ingest into the data node using the returned location

curl -i -X PUT -T <LOCAL_FILE> "http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE..."

For this example, the command line is:

curl -i -X PUT -T test.csv "http://localhost:9864/webhdfs/v1/test/test.csv?op=CREATE&namenoderpcaddress=0.0.0.0:19000&createflag=&createparent=true&overwrite=true&user.name=***"

The output looks like the following:

HTTP/1.1 100 Continue

HTTP/1.1 201 Created
Location: hdfs://0.0.0.0:19000/test/test.csv
Content-Length: 0
Access-Control-Allow-Origin: *
Connection: close
infoSimilar as the the previous command of creating directory, I'm also including user name parameter in URL to avoid permission issue. In an enterprise environment, the data node server host is usually different from name node host. For my environment, both data node and name node runs on localhost.

Reference

Refer to official documentation about WebHDFS Rest API for complete guide of available APIs. 

info Last modified by Raymond at 2 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Want to publish your article on Kontext?

Learn more

Kontext Column

Created for everyone to publish data, programming and cloud related articles.
Follow three steps to create your columns.


Learn more arrow_forward

More from Kontext

local_offer hadoop local_offer shell

visibility 29
thumb_up 0
access_time 11 months ago

Hadoop provides a number of CLIs. hadoop job command can be used to retrieve running job list.

You can also use YARN resource manager UI to view the jobs too.

local_offer hadoop local_offer hdfs local_offer parquet local_offer sqoop local_offer big-data-on-linux

visibility 2965
thumb_up 0
access_time 3 years ago

This page continues with the following documentation about configuring a Hadoop multi-nodes cluster via adding a new edge node to configure administration or client tools. Configure Hadoop 3.1.0 in a Multi Node Cluster In this page, I’m going to show you how to add a edge node into the ...

local_offer spark local_offer hadoop local_offer yarn local_offer oozie local_offer spark-advanced

visibility 1731
thumb_up 0
access_time 2 years ago

Recently I created an Oozie workflow which contains one Spark action. The Spark action master is yarn and deploy mode is cluster. Each time when the job runs about 30 minutes, the application fails with errors like the following: Application application_** failed 2 times due to AM Container for ...

About column

Articles about Apache Hadoop installation, performance tuning and general tutorials.

*The yellow elephant logo is a registered trademark of Apache Hadoop.

rss_feed Subscribe RSS