Use Hadoop File System Task in SSIS to Write File into HDFS
Context
SQL Server Integration Service (SSIS) has tasks to perform operations against Hadoop, for example:
- Hadoop File System Task
- Hadoop Hive Task
- Hadoop Pig Task
In Data Flow Task, you can also use:
- Hadoop HDFS Source
- Hadoop HDFS Destination
In this page, I’m going to demonstrate how to write file into HDFS through SSIS Hadoop File System Task.
References
https://docs.microsoft.com/en-us/sql/integration-services/control-flow/hadoop-file-system-task
Prerequisites
Hadoop
Refer to the following page to install Hadoop if you don’t have one instance to play with.
Install Hadoop 3.0.0 in Windows (Single Node)
SSIS
SSIS can be installed via SQL Server Data Tools (SSDT). In this example, I am using 15.1.
Create Hadoop connection manager
In your SSIS package, create a Hadoop Connection Manager:
In WebHDFS tab of the editor, specify the following details:
- WebHDFS Host: the host of your web HDFS. in this example, it is pointing to my local Hadoop instance.
- WebHDFS Port: the port of Name Node. It is the same port as your Name Node UI https://kontext.tech/docs/DataAndBusinessIntelligence/p/install-hadoop-300-in-windows-single-node#NameNode_UI
- Authentication: Basic (I am not using Kerberos)
- WebHDFS User: fahao
Click Test Connection button to ensure you can connect and then click OK:
Create a file connection manager
Create a local CSV file
Create a local CSV file named F:\DataAnalytics\Sales.csv with the following content:
Month,Amount
1/01/2017,30022
1/02/2017,12334
1/03/2017,33455
1/04/2017,50000
1/05/2017,33333
1/06/2017,11344
1/07/2017,12344
1/08/2017,24556
1/09/2017,46667
Create a file connection manager
Create a file connection manager Sales.csv which points to the file created above.
Create Hadoop File System Task
Use the two connection managers created above to create a Hadoop File System Task:
In the above settings, it uploads Sales.csv into /Sales.csv in HDFS.
Run the package
Run the package or execute the task to make sure it is completed successfully:
Verify the result via HDFS CLI
Use the following command to verify whether the file is uploaded successfully:
hdfs dfs -ls \
You can also print out the content via the following command:
hdfs dfs -cat /Sales.csv
Verify the result through Name Node web UI
WebHDFS REST API reference
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html
Summary
It is very easy to upload files into HDFS through SSIS. You can also upload the whole directory into HDFS through this task if you change the file connection manager to pointing to a folder.
If you have any questions, please let me know.