Load File into HDFS through WebHDFS APIs
To ingest data into HDFS, one of the commonly used approach is to upload files into a temporary folder in one of the Edge server of Hadoop cluster, where HDFS CLIs are available to copy file from local to the distributed file system.
In the past, I've published several related articles about WebHDFS:
This article will expand further on WebHDFS. This article uses curl command to call these APIs. You can implement these web requests using your preferable programming language too like Python, Java, C#, Go, etc.
Hadoop environment
To test these APIs, I am using the Hadoop single node cluster setup based on the following guide:
Install Hadoop 3.3.0 on Windows 10 Step by Step Guide
For this environment, the host is localhost and port of name node is 9870.
Create a directory in HDFS
Let's first create a directory to see how the web RESTful APIs work.
The syntax of calling this API is:
curl -i -X PUT "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=MKDIRS[&permission=<OCTAL>]"
Replace those parameters accordingly based on your environment and then run the command in a terminal that supports curl command.
In my environment, the following command is run:
curl -i -X PUT "http://localhost:9870/webhdfs/v1/test?op=MKDIRS&permission=755"
The command line returned with permission error as the following screenshot shows:
As I have not done any authentication, the default web user dr.who is used which was then rejected.
Permission denied: user=dr.who, access=WRITE
To fix this issue, we can add request user name in the URL:
curl -i -X PUT "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=MKDIRS&user.name=<USER>[&permission=<OCTAL>]"
HTTP/1.1 200 OK Date: Sat, 22 Aug 2020 11:42:41 GMT Cache-Control: no-cache Expires: Sat, 22 Aug 2020 11:42:41 GMT Date: Sat, 22 Aug 2020 11:42:41 GMT Pragma: no-cache X-Content-Type-Options: nosniff X-FRAME-OPTIONS: SAMEORIGIN X-XSS-Protection: 1; mode=block Set-Cookie: hadoop.auth="u=***&p=***&t=simple&e=1598132561905&s=1Y3XzIZ7zBNyzsWECIT27gae+IDcUvpvEhwaScJRE48="; Path=/; HttpOnly Content-Type: application/json Transfer-Encoding: chunked
As part of the response, it also sets up a client cookie for Hadoop authentication.
Hadoop security mode
In my Hadoop cluster, the security mode is turned off which is why I can directly call the API with specified user. This is not secured in an enterprise environment as everyone can perform all the actions and access to all the data using this approach.
To enable Kerberos authentication, the following configurations need to be added in Hadoop configuration file (dfs.xml and core-site.xml):
- dfs.web.authentication.kerberos.principal: The HTTP Kerberos principal used by Hadoop-Auth in the HTTP endpoint. The HTTP Kerberos principal MUST start with ‘HTTP/’ per Kerberos HTTP SPNEGO specification. A value of “*” will use all HTTP principals found in the keytab.
- dfs.web.authentication.kerberos.keytab: The Kerberos keytab file with the credentials for the HTTP Kerberos principal used by Hadoop-Auth in the HTTP endpoint.
- hadoop.http.authentication.type: Defines authentication used for the HTTP web-consoles. The supported values are: simple | kerberos | #AUTHENTICATION_HANDLER_CLASSNAME#.You can customize authentication handler which provides opportunity to use other middleware/tools to perform authentications, for example Apache Ranger.
When security is enabled, all requests need to use client that supports Kerberos HTTP SPNEGO authentication.
For curl command line, refer to this page for more details about how to initialize Kerberos configuration and also use it for requests.
For programming languages/frameworks, there are usually packages available to perform HTTP SPNEGO. For example, package requests-gssapi can be used in Python.
You can also use delegation token when security is enabled. Delegation token can be retrieved using the following command:
curl -i "http://<HOST>:<PORT>/webhdfs/v1/?op=GETDELEGATIONTOKEN[&renewer=<USER>][&service=<SERVICE>][&kind=<KIND>]"
Ingest file into HDFS
CREATE operation can be used to upload a file into HDFS. There are two steps required:
1) Get the data node location
curl -i -X PUT "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CREATE [&overwrite=<true |false>][&blocksize=<LONG>][&replication=<SHORT>] [&permission=<OCTAL>][&buffersize=<INT>][&noredirect=<true|false>]"
In my environment, I run the following command to find the data node location:
curl -i -X PUT "http://localhost:9870/webhdfs/v1/test/test.csv?op=CREATE&overwrite=true&noredirect=false"
The command line returns Location in the response body as the following screenshot shows:
2) Ingest into the data node using the returned location
curl -i -X PUT -T <LOCAL_FILE> "http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE..."
For this example, the command line is:
curl -i -X PUT -T test.csv "http://localhost:9864/webhdfs/v1/test/test.csv?op=CREATE&namenoderpcaddress=0.0.0.0:19000&createflag=&createparent=true&overwrite=true&user.name=***"
The output looks like the following:
HTTP/1.1 100 Continue HTTP/1.1 201 Created Location: hdfs://0.0.0.0:19000/test/test.csv Content-Length: 0 Access-Control-Allow-Origin: * Connection: close
Reference
Refer to official documentation about WebHDFS Rest API for complete guide of available APIs.