Ingest Data into HDFS from NAS or Windows Shared Folder

Network Attached Storage are commonly used in many enterprises where files are stored remotely on those servers. They typically provide access to files using network file sharing protocols such as NFS, SMB, or AFP. In some cases, you may want to ingest these data into Hadoop HDFS from a NAS folder. This pages provides some thinking around how to ingest data from NAS to HDFS.

infoIn the following solutions, there are two parts involved:

An intermediate server (can be Hadoop cluster edge server) is used to access NAS or Windows shared folder.
HDFS CLI or WebHDFS APIs are used to ingest data into HDFS.

Mount NAS or shared folder as native drive

If your Hadoop is deployed on Windows servers, you can easily map NAS or shared folder as native drive through 'Map network drive' wizard in any of your cluster edge servers.

If your Hadoop is installed on Linux servers, you can use the following command to mount:

mount -h

Usage:
 mount [-lhV]
 mount -a [options]
 mount [options] [--source] <source> | [--target] <directory>
 mount [options] <source> <directory>
 mount <operation> <mountpoint> [<target>]

Mount a filesystem.

Options:
 -a, --all               mount all filesystems mentioned in fstab
 -c, --no-canonicalize   don't canonicalize paths
 -f, --fake              dry run; skip the mount(2) syscall
 -F, --fork              fork off for each device (use with -a)
 -T, --fstab <path>      alternative file to /etc/fstab
 -i, --internal-only     don't call the mount.<type> helpers
 -l, --show-labels       show also filesystem labels
 -n, --no-mtab           don't write to /etc/mtab
 -o, --options <list>    comma-separated list of mount options
 -O, --test-opts <list>  limit the set of filesystems (use with -a)
 -r, --read-only         mount the filesystem read-only (same as -o ro)
 -t, --types <list>      limit the set of filesystem types
     --source <src>      explicitly specifies source (path, label, uuid)
     --target <target>   explicitly specifies mountpoint
 -v, --verbose           say what is being done
 -w, --rw, --read-write  mount the filesystem read-write (default)

 -h, --help              display this help
 -V, --version           display version

Source:
 -L, --label <label>     synonym for LABEL=<label>
 -U, --uuid <uuid>       synonym for UUID=<uuid>
 LABEL=<label>           specifies device by filesystem label
 UUID=<uuid>             specifies device by filesystem UUID
 PARTLABEL=<label>       specifies device by partition label
 PARTUUID=<uuid>         specifies device by partition UUID
 <device>                specifies device by path
 <directory>             mountpoint for bind mounts (see --bind/rbind)
 <file>                  regular file for loopdev setup

Operations:
 -B, --bind              mount a subtree somewhere else (same as -o bind)
 -M, --move              move a subtree to some other place
 -R, --rbind             mount a subtree and all submounts somewhere else
 --make-shared           mark a subtree as shared
 --make-slave            mark a subtree as slave
 --make-private          mark a subtree as private
 --make-unbindable       mark a subtree as unbindable
 --make-rshared          recursively mark a whole subtree as shared
 --make-rslave           recursively mark a whole subtree as slave
 --make-rprivate         recursively mark a whole subtree as private
 --make-runbindable      recursively mark a whole subtree as unbindable

For more details see mount(8).

Ingest data using hadoop fs -copyFromLocal

Once you mount or map the network drives, you can then use hadoop fs -copyFromLocal command to ingest data to HDFS.

# Linux
hadoop fs -copyFromLocal /mnt/path/to/file /hdfs/path
# Windows
hadoop fs -copyFromLocal /Z/path/to/file /hdfs/path

SCP/SFTP to upload file from a proxy server to Hadoop edge server

Another approach is to use SFTP or SCP protocols in an intermediate server where it has access to the network drives to upload the files into Hadoop edge server.

This can be done through command line interfaces or programming packages.

For CLIs or client tools, refer to SFTP or SCP for more details.

Example command

sftp -b /path/to/local/file user@your-edge-server.com

sftp/scp packages

If you code with Python, you can use pysftp or scp packages to upload files from an intermediate server to Hadoop edge server.

import scp# Create client using ssh key file
client = scp.Client(host=your-edge-server, user='user', keyfile='/path/to/ssh_keyfile')
# or Create client using system keys
client = scp.Client(host=your-edge-server, user='user')client.use_system_keys()
# or Create client using user name and password
client = scp.Client(host=your-edge-server, user='user', password='password')
# and then
client.transfer('/path/to/local/file', '/path/to/edge/inbound')

These client libraries are also available in most of other languages/frameworks such as .NET, Java, etc.

Ingest data using hadoop fs -copyFromLocal

Once data is copied to your cluster edge server, you can use hadoop fs command to copy from local to HDFS as the first approach shows.

Utilize WebHDFS API

In the above approaches, local/native HDFS CLIs are used to ingest data. These approaches require data to be transferred to edge server or mapped/mounted first. A different approach is to directly use WebHDFS APIs.

About WebHDFS REST API

You can find more details about WebHDFS API on the official documentation https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/WebHDFS.html page.

Long story short, the HTTP REST API supports the complete FileSystem/FileContext interface for HDFS. We can use HTTP requests to ingest data directly into HDFS. HTTP Query Parameter Dictionary specifies the parameter details for each different operations.

CREATE a file

You can use CREATE operation to write a file. The following is the syntax of using curlto call this API. You can choose any other languages that support HTTP calls to invoke the APIs too.

Step 1 Call using PUT HTTP method

This API call returns a location of datanode server address where the data will be written into.

curl -i -X PUT "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CREATE
                    [&overwrite=<true |false>][&blocksize=<LONG>][&replication=<SHORT>]
                    [&permission=<OCTAL>][&buffersize=<INT>][&noredirect=<true|false>]"

Response:

HTTP/1.1 200 OK
Content-Type: application/json
{"Location":"http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE..."}

Step 2 Call API on data node

Use the location in the header or response JSON body (depends on whether redirect or not) to put local file into data node.

curl -i -X PUT -T <LOCAL_FILE> "http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE..."

The response looks like the following:

HTTP/1.1 201 Created
Location: webhdfs://<HOST>:<PORT>/<PATH>
Content-Length: 0

About authentication

One question you may ask is the authentication part. Refer to Authentication section on official documentation page for more details. To summarize, when security is off, the authenticated user is the username specified in the user.name query parameter. When security is on, authentication is performed by either Hadoop delegation token or Kerberos SPNEGO.

Other approaches

Do you use other approaches to ingest data into HDFS from Windows shared folder or NAS? if so, feel free to share your ideas in the comments area.

Mount NAS or shared folder as native drive

Ingest data using hadoop fs -copyFromLocal

SCP/SFTP to upload file from a proxy server to Hadoop edge server

Example command

sftp/scp packages

Ingest data using hadoop fs -copyFromLocal

Utilize WebHDFS API

About WebHDFS REST API

CREATE a file

Step 1 Call using PUT HTTP method

Step 2 Call API on data node

About authentication

Other approaches

In this article