Ingest Data into HDFS from NAS or Windows Shared Folder
insights Stats
Articles about Apache Hadoop, Hive and HBase installation, performance tuning and general tutorials.
*The yellow elephant logo is a registered trademark of Apache Hadoop.
- Mount NAS or shared folder as native drive
- Ingest data using hadoop fs -copyFromLocal
- SCP/SFTP to upload file from a proxy server to Hadoop edge server
- Example command
- sftp/scp packages
- Ingest data using hadoop fs -copyFromLocal
- Utilize WebHDFS API
- About WebHDFS REST API
- CREATE a file
- About authentication
- Other approaches
Network Attached Storage are commonly used in many enterprises where files are stored remotely on those servers. They typically provide access to files using network file sharing protocols such as NFS, SMB, or AFP. In some cases, you may want to ingest these data into Hadoop HDFS from a NAS folder. This pages provides some thinking around how to ingest data from NAS to HDFS.
1) An intermediate server (can be Hadoop cluster edge server) is used to access NAS or Windows shared folder.
2) HDFS CLI or WebHDFS APIs are used to ingest data into HDFS.
Mount NAS or shared folder as native drive
If your Hadoop is deployed on Windows servers, you can easily map NAS or shared folder as native drive through 'Map network drive' wizard in any of your cluster edge servers.
If your Hadoop is installed on Linux servers, you can use the following command to mount:
mount -h Usage: mount [-lhV] mount -a [options] mount [options] [--source] <source> | [--target] <directory> mount [options] <source> <directory> mount <operation> <mountpoint> [<target>] Mount a filesystem. Options: -a, --all mount all filesystems mentioned in fstab -c, --no-canonicalize don't canonicalize paths -f, --fake dry run; skip the mount(2) syscall -F, --fork fork off for each device (use with -a) -T, --fstab <path> alternative file to /etc/fstab -i, --internal-only don't call the mount.<type> helpers -l, --show-labels show also filesystem labels -n, --no-mtab don't write to /etc/mtab -o, --options <list> comma-separated list of mount options -O, --test-opts <list> limit the set of filesystems (use with -a) -r, --read-only mount the filesystem read-only (same as -o ro) -t, --types <list> limit the set of filesystem types --source <src> explicitly specifies source (path, label, uuid) --target <target> explicitly specifies mountpoint -v, --verbose say what is being done -w, --rw, --read-write mount the filesystem read-write (default) -h, --help display this help -V, --version display version Source: -L, --label <label> synonym for LABEL=<label> -U, --uuid <uuid> synonym for UUID=<uuid> LABEL=<label> specifies device by filesystem label UUID=<uuid> specifies device by filesystem UUID PARTLABEL=<label> specifies device by partition label PARTUUID=<uuid> specifies device by partition UUID <device> specifies device by path <directory> mountpoint for bind mounts (see --bind/rbind) <file> regular file for loopdev setup Operations: -B, --bind mount a subtree somewhere else (same as -o bind) -M, --move move a subtree to some other place -R, --rbind mount a subtree and all submounts somewhere else --make-shared mark a subtree as shared --make-slave mark a subtree as slave --make-private mark a subtree as private --make-unbindable mark a subtree as unbindable --make-rshared recursively mark a whole subtree as shared --make-rslave recursively mark a whole subtree as slave --make-rprivate recursively mark a whole subtree as private --make-runbindable recursively mark a whole subtree as unbindable For more details see mount(8).
Ingest data using hadoop fs -copyFromLocal
Once you mount or map the network drives, you can then use hadoop fs -copyFromLocal command to ingest data to HDFS.
# Linux hadoop fs -copyFromLocal /mnt/path/to/file /hdfs/path # Windows hadoop fs -copyFromLocal /Z/path/to/file /hdfs/path
SCP/SFTP to upload file from a proxy server to Hadoop edge server
Another approach is to use SFTP or SCP protocols in an intermediate server where it has access to the network drives to upload the files into Hadoop edge server.
This can be done through command line interfaces or programming packages.
For CLIs or client tools, refer to SFTP or SCP for more details.
Example command
sftp -b /path/to/local/file user@your-edge-server.com
sftp/scp packages
If you code with Python, you can use pysftp or scp packages to upload files from an intermediate server to Hadoop edge server.
import scp
# Create client using ssh key file client = scp.Client(host=your-edge-server, user='user', keyfile='/path/to/ssh_keyfile') # or Create client using system keys client = scp.Client(host=your-edge-server, user='user')
client.use_system_keys() # or Create client using user name and password client = scp.Client(host=your-edge-server, user='user', password='password')
# and then client.transfer('/path/to/local/file', '/path/to/edge/inbound')
These client libraries are also available in most of other languages/frameworks such as .NET, Java, etc.
Ingest data using hadoop fs -copyFromLocal
Utilize WebHDFS API
In the above approaches, local/native HDFS CLIs are used to ingest data. These approaches require data to be transferred to edge server or mapped/mounted first. A different approach is to directly use WebHDFS APIs.
About WebHDFS REST API
You can find more details about WebHDFS API on the official documentation page.
Long story short, the HTTP REST API supports the complete FileSystem/FileContext interface for HDFS. We can use HTTP requests to ingest data directly into HDFS. HTTP Query Parameter Dictionary specifies the parameter details for each different operations.
CREATE a file
You can use CREATE operation to write a file. The following is the syntax of using curl to call this API. You can choose any other languages that support HTTP calls to invoke the APIs too.
Step 1 Call using PUT HTTP method
This API call returns a location of datanode server address where the data will be written into.
curl -i -X PUT "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CREATE [&overwrite=<true |false>][&blocksize=<LONG>][&replication=<SHORT>] [&permission=<OCTAL>][&buffersize=<INT>][&noredirect=<true|false>]"
Response:
HTTP/1.1 200 OK Content-Type: application/json {"Location":"http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE..."}
Step 2 Call API on data node
Use the location in the header or response JSON body (depends on whether redirect or not) to put local file into data node.
curl -i -X PUT -T <LOCAL_FILE> "http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE..."
The response looks like the following:
HTTP/1.1 201 Created Location: webhdfs://<HOST>:<PORT>/<PATH> Content-Length: 0
About authentication
One question you may ask is the authentication part. Refer to Authentication section on official documentation page for more details. To summarize, when security is off, the authenticated user is the username specified in the user.name query parameter. When security is on, authentication is performed by either Hadoop delegation token or Kerberos SPNEGO.
Other approaches
Do you use other approaches to ingest data into HDFS from Windows shared folder or NAS? if so, feel free to share your ideas in the comments area.