Load File into HDFS through WebHDFS APIs

To ingest data into HDFS, one of the commonly used approach is to upload files into a temporary folder in one of the Edge server of Hadoop cluster, where HDFS CLIs are available to copy file from local to the distributed file system.

In the past, I've published several related articles about WebHDFS:

This article will expand further on WebHDFS. This article uses curlcommand to call these APIs. You can implement these web requests using your preferable programming language too like Python, Java, C#, Go, etc.

Hadoop environment

To test these APIs, I am using the Hadoop single node cluster setup based on the following guide:

Install Hadoop 3.3.0 on Windows 10 Step by Step Guide

For this environment, the host is localhost and port of name node is 9870.

Create a directory in HDFS

Let's first create a directory to see how the web RESTful APIs work.

The syntax of calling this API is:

curl -i -X PUT "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=MKDIRS[&permission=<OCTAL>]"

Replace those parameters accordingly based on your environment and then run the command in a terminal that supports curl command.

In my environment, the following command is run:

curl -i -X PUT "http://localhost:9870/webhdfs/v1/test?op=MKDIRS&permission=755"

The command line returned with permission error as the following screenshot shows:

As I have not done any authentication, the default web user dr.who is used which was then rejected.

Permission denied: user=dr.who, access=WRITE

To fix this issue, we can add request user name in the URL:

curl -i -X PUT "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=MKDIRS&user.name=<USER>[&permission=<OCTAL>]"

Replace user to your Hadoop user name and the command will run successfully if security is off.

The following is a sample HTTP request response:

HTTP/1.1 200 OK
Date: Sat, 22 Aug 2020 11:42:41 GMT
Cache-Control: no-cache
Expires: Sat, 22 Aug 2020 11:42:41 GMT
Date: Sat, 22 Aug 2020 11:42:41 GMT
Pragma: no-cache
X-Content-Type-Options: nosniff
X-FRAME-OPTIONS: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Set-Cookie: hadoop.auth="u=***&p=***&t=simple&e=1598132561905&s=1Y3XzIZ7zBNyzsWECIT27gae+IDcUvpvEhwaScJRE48="; Path=/; HttpOnly
Content-Type: application/json
Transfer-Encoding: chunked

As part of the response, it also sets up a client cookie for Hadoop authentication.

Hadoop security mode

In my Hadoop cluster, the security mode is turned off which is why I can directly call the API with specified user. This is not secured in an enterprise environment as everyone can perform all the actions and access to all the data using this approach.

To enable Kerberos authentication, the following configurations need to be added in Hadoop configuration file (dfs.xml and core-site.xml):

dfs.web.authentication.kerberos.principal: The HTTP Kerberos principal used by Hadoop-Auth in the HTTP endpoint. The HTTP Kerberos principal MUST start with ‘HTTP/’ per Kerberos HTTP SPNEGO specification. A value of “*” will use all HTTP principals found in the keytab.
dfs.web.authentication.kerberos.keytab: The Kerberos keytab file with the credentials for the HTTP Kerberos principal used by Hadoop-Auth in the HTTP endpoint.
hadoop.http.authentication.type: Defines authentication used for the HTTP web-consoles. The supported values are: simple | kerberos | #AUTHENTICATION_HANDLER_CLASSNAME#.You can customize authentication handler which provides opportunity to use other middleware/tools to perform authentications, for example Apache Ranger.

When security is enabled, all requests need to use client that supports Kerberos HTTP SPNEGO authentication.

For curl command line, refer to this page for more details about how to initialize Kerberos configuration and also use it for requests.

For programming languages/frameworks, there are usually packages available to perform HTTP SPNEGO. For example, package requests-gssapi can be used in Python.

You can also use delegation token when security is enabled. Delegation token can be retrieved using the following command:

curl -i "http://<HOST>:<PORT>/webhdfs/v1/?op=GETDELEGATIONTOKEN[&renewer=<USER>][&service=<SERVICE>][&kind=<KIND>]"

infoToken will only be issued when security is on.

Ingest file into HDFS

CREATE operation can be used to upload a file into HDFS. There are two steps required:

Get the data node location

curl -i -X PUT "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CREATE
                    [&overwrite=<true |false>][&blocksize=<LONG>][&replication=<SHORT>]
                    [&permission=<OCTAL>][&buffersize=<INT>][&noredirect=<true|false>]"

In my environment, I run the following command to find the data node location:

curl -i -X PUT "http://localhost:9870/webhdfs/v1/test/test.csv?op=CREATE&overwrite=true&noredirect=false"

The command line returns Location in the response body as the following screenshot shows:

Ingest into the data node using the returned location

curl -i -X PUT -T <LOCAL_FILE> "http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE..."

For this example, the command line is:

curl -i -X PUT -T test.csv "http://localhost:9864/webhdfs/v1/test/test.csv?op=CREATE&namenoderpcaddress=0.0.0.0:19000&createflag=&createparent=true&overwrite=true&user.name=***"

The output looks like the following:

HTTP/1.1 100 Continue

HTTP/1.1 201 Created
Location: hdfs://0.0.0.0:19000/test/test.csv
Content-Length: 0
Access-Control-Allow-Origin: *
Connection: close

infoSimilar as the the previous command of creating directory, I'm also including user name parameter in URL to avoid permission issue. In an enterprise environment, the data node server host is usually different from name node host. For my environment, both data node and name node runs on localhost.

Reference

Refer to official documentation about WebHDFS Rest API for complete guide of available APIs.