Ingest Data into HDFS from NAS or Windows Shared Folder

access_time 8 months ago visibility921 comment 0

Network Attached Storage are commonly used in many enterprises where files are stored remotely on those servers. They typically provide access to files using network file sharing protocols such as NFSSMB, or AFPIn some cases, you may want to ingest these data into Hadoop HDFS from a NAS folder. This pages provides some thinking around how to ingest data from NAS to HDFS.

infoIn the following solutions, there are two parts involved:
1) An intermediate server (can be Hadoop cluster edge server) is used to access NAS or Windows shared folder. 
2) HDFS CLI or WebHDFS APIs are used to ingest data into HDFS. 

Mount NAS or shared folder as native drive

If your Hadoop is deployed on Windows servers, you can easily map NAS or shared folder as native drive through 'Map network drive' wizard in any of your cluster edge servers.

If your Hadoop is installed on Linux servers, you can use the following command to mount:

mount -h

 mount [-lhV]
 mount -a [options]
 mount [options] [--source] <source> | [--target] <directory>
 mount [options] <source> <directory>
 mount <operation> <mountpoint> [<target>]

Mount a filesystem.

 -a, --all               mount all filesystems mentioned in fstab
 -c, --no-canonicalize   don't canonicalize paths
 -f, --fake              dry run; skip the mount(2) syscall
 -F, --fork              fork off for each device (use with -a)
 -T, --fstab <path>      alternative file to /etc/fstab
 -i, --internal-only     don't call the mount.<type> helpers
 -l, --show-labels       show also filesystem labels
 -n, --no-mtab           don't write to /etc/mtab
 -o, --options <list>    comma-separated list of mount options
 -O, --test-opts <list>  limit the set of filesystems (use with -a)
 -r, --read-only         mount the filesystem read-only (same as -o ro)
 -t, --types <list>      limit the set of filesystem types
     --source <src>      explicitly specifies source (path, label, uuid)
     --target <target>   explicitly specifies mountpoint
 -v, --verbose           say what is being done
 -w, --rw, --read-write  mount the filesystem read-write (default)

 -h, --help              display this help
 -V, --version           display version

 -L, --label <label>     synonym for LABEL=<label>
 -U, --uuid <uuid>       synonym for UUID=<uuid>
 LABEL=<label>           specifies device by filesystem label
 UUID=<uuid>             specifies device by filesystem UUID
 PARTLABEL=<label>       specifies device by partition label
 PARTUUID=<uuid>         specifies device by partition UUID
 <device>                specifies device by path
 <directory>             mountpoint for bind mounts (see --bind/rbind)
 <file>                  regular file for loopdev setup

 -B, --bind              mount a subtree somewhere else (same as -o bind)
 -M, --move              move a subtree to some other place
 -R, --rbind             mount a subtree and all submounts somewhere else
 --make-shared           mark a subtree as shared
 --make-slave            mark a subtree as slave
 --make-private          mark a subtree as private
 --make-unbindable       mark a subtree as unbindable
 --make-rshared          recursively mark a whole subtree as shared
 --make-rslave           recursively mark a whole subtree as slave
 --make-rprivate         recursively mark a whole subtree as private
 --make-runbindable      recursively mark a whole subtree as unbindable

For more details see mount(8).

Ingest data using hadoop fs -copyFromLocal

Once you mount or map the network drives, you can then use hadoop fs -copyFromLocal command to ingest data to HDFS.

# Linux
hadoop fs -copyFromLocal /mnt/path/to/file /hdfs/path
# Windows
hadoop fs -copyFromLocal /Z/path/to/file /hdfs/path

SCP/SFTP to upload file from a proxy server to Hadoop edge server

Another approach is to use SFTP or SCP protocols in an intermediate server where it has access to the network drives to upload the files into Hadoop edge server.

This can be done through command line interfaces or programming packages.

For CLIs or client tools, refer to SFTP or SCP for more details.

Example command

sftp -b /path/to/local/file

sftp/scp packages

If you code with Python, you can use pysftp or scp packages to upload files from an intermediate server to Hadoop edge server.

import scp
# Create client using ssh key file client = scp.Client(host=your-edge-server, user='user', keyfile='/path/to/ssh_keyfile') # or Create client using system keys client = scp.Client(host=your-edge-server, user='user')
client.use_system_keys() # or Create client using user name and password client = scp.Client(host=your-edge-server, user='user', password='password')
# and then client.transfer('/path/to/local/file', '/path/to/edge/inbound')

These client libraries are also available in most of other languages/frameworks such as .NET, Java, etc.

Ingest data using hadoop fs -copyFromLocal

Once data is copied to your cluster edge server, you can use hadoop fs command to copy from local to HDFS as the first approach shows.

Utilize WebHDFS API

In the above approaches, local/native HDFS CLIs are used to ingest data. These approaches require data to be transferred to edge server or mapped/mounted first. A different approach is to directly use WebHDFS APIs.


You can find more details about WebHDFS API on the official documentation page.

Long story short, the HTTP REST API supports the complete FileSystem/FileContext interface for HDFS. We can use HTTP requests to ingest data directly into HDFS. HTTP Query Parameter Dictionary specifies the parameter details for each different operations.

CREATE a file

You can use CREATE operation to write a file. The following is the syntax of using curl to call this API. You can choose any other languages that support HTTP calls to invoke the APIs too. 

Step 1 Call using PUT HTTP method

This API call returns a location of datanode server address where the data will be written into. 

curl -i -X PUT "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CREATE
                    [&overwrite=<true |false>][&blocksize=<LONG>][&replication=<SHORT>]


HTTP/1.1 200 OK
Content-Type: application/json
Step 2 Call API on data node

Use the location in the header or response JSON body (depends on whether redirect or not) to put local file into data node.

curl -i -X PUT -T <LOCAL_FILE> "http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE..."

The response looks like the following:

HTTP/1.1 201 Created
Location: webhdfs://<HOST>:<PORT>/<PATH>
Content-Length: 0

About authentication

One question you may ask is the authentication part. Refer to Authentication section on official documentation page for more details. To summarize, when security is off, the authenticated user is the username specified in the query parameter. When security is on, authentication is performed by either Hadoop delegation token or Kerberos SPNEGO.

Other approaches

Do you use other approaches to ingest data into HDFS from Windows shared folder or NAS? if so, feel free to share your ideas in the comments area.

info Last modified by Raymond at 2 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Want to publish your article on Kontext?

Learn more

Kontext Column

Created for everyone to publish data, programming and cloud related articles.
Follow three steps to create your columns.

Learn more arrow_forward

More from Kontext

local_offer zeppelin local_offer spark local_offer hadoop local_offer rdd local_offer spark-file-operations

visibility 7008
thumb_up 0
access_time 3 years ago

This page provides an example to load text file from HDFS through SparkContext in Zeppelin (sc). The details about this method can be found at: ...

Fix for Hadoop 3.2.1 namenode format issue on Windows 10

local_offer windows10 local_offer hadoop local_offer hdfs

visibility 1897
thumb_up 0
access_time 9 months ago

When installing Hadoop 3.2.1 on Windows 10,  you may encounter the following error when trying to format HDFS  namnode: ERROR namenode.NameNode: Failed to start namenode. The error happens when running the following command in Command Prompt: hdfs namenode -format 2020-01-18 ...

local_offer hadoop local_offer hdfs

visibility 1310
thumb_up 0
access_time 3 years ago

After finishing installation Hadoop 3.0.0 in my Windows: Install Hadoop 3.0.0 in Windows (Single Node) , I got the following error after I formated the name node several times. The following error is thrown out when I tried to start Hadoop HDFS. 2018-02-19 22:02:06,848 WARN common.Storage ...

About column

Articles about Apache Hadoop installation, performance tuning and general tutorials.

*The yellow elephant logo is a registered trademark of Apache Hadoop.

rss_feed Subscribe RSS