By using this site, you acknowledge that you have read and understand our Cookie policy, Privacy policy and Terms .
close

Articles about Apache Hadoop installation, performance tuning and general tutorials.

rss_feed Subscribe RSS

Context

SQL Server Integration Service (SSIS) has tasks to perform operations against Hadoop, for example:

  • Hadoop File System Task
  • Hadoop Hive Task
  • Hadoop Pig Task

In Data Flow Task, you can also use:

  • Hadoop HDFS Source
  • Hadoop HDFS Destination

In this page, I’m going to demonstrate how to write file into HDFS through SSIS Hadoop File System Task.

References

https://docs.microsoft.com/en-us/sql/integration-services/control-flow/hadoop-file-system-task

Prerequisites

Hadoop

Refer to the following page to install Hadoop if you don’t have one instance to play with.

Install Hadoop 3.0.0 in Windows (Single Node)

SSIS

SSIS can be installed via SQL Server Data Tools (SSDT). In this example, I am using 15.1.

Create Hadoop connection manager

In your SSIS package, create a Hadoop Connection Manager:

image

In WebHDFS tab of the editor, specify the following details:

image

Click Test Connection button to ensure you can connect and then click OK:

image

Create a file connection manager

Create a local CSV file

Create a local CSV file named F:\DataAnalytics\Sales.csv with the following content:

Month,Amount
1/01/2017,30022
1/02/2017,12334
1/03/2017,33455
1/04/2017,50000
1/05/2017,33333
1/06/2017,11344
1/07/2017,12344
1/08/2017,24556
1/09/2017,46667

Create a file connection manager

Create a file connection manager Sales.csv which points to the file created above.

image

Create Hadoop File System Task

Use the two connection managers created above to create a Hadoop File System Task:

image

In the above settings, it uploads Sales.csv into /Sales.csv in HDFS.

Run the package

Run the package or execute the task to make sure it is completed successfully:

image

Verify the result via HDFS CLI

Use the following command to verify whether the file is uploaded successfully:

hdfs dfs -ls \

image

You can also print out the content via the following command:

hdfs dfs -cat /Sales.csv

image

Verify the result through Name Node web UI

image

image

WebHDFS REST API reference

    https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html

Summary

It is very easy to upload files into HDFS through SSIS. You can also upload the whole directory into HDFS through this task if you change the file connection manager to pointing to a folder.

If you have any questions, please let me know.

info Last modified by Raymond at 2 years ago
info About author

info License/Terms

More from Kontext

local_offer windows10 local_offer hadoop local_offer hdfs

visibility 8
comment 0
thumb_up 0
access_time 1 day ago

Issue When installing Hadoop 3.2.1 on Windows 10,  you may encounter the following error when trying to format HDFS  namnode: ERROR namenode.NameNode: Failed to start namenode. The error happens when running the following comm...

open_in_new View

Compile and Build Hadoop 3.2.1 on Windows 10 Guide

local_offer windows10 local_offer hadoop

visibility 71
comment 0
thumb_up 1
access_time 6 days ago

This article provides detailed steps about how to compile and build Hadoop (incl. native libs) on Windows 10. The following guide is based on Hadoop release 3.2.1. ...

open_in_new View

Latest Hadoop 3.2.1 Installation on Windows 10 Step by Step Guide

local_offer windows10 local_offer hadoop local_offer yarn

visibility 71
comment 0
thumb_up 1
access_time 8 days ago

This detailed step-by-step guide shows you how to install the latest Hadoop (v3.2.1) on Windows 10. It also provides a temporary fix for bug HDFS-14084 (java.lang.UnsupportedOperationException INFO).

open_in_new View

local_offer spark local_offer hadoop local_offer pyspark local_offer oozie local_offer hue

visibility 869
comment 0
thumb_up 0
access_time 6 months ago

When submitting Spark applications to YARN cluster, two deploy modes can be used: client and cluster. For client mode (default), Spark driver runs on the machine that the Spark application was submitted while for cluster mode, the driver runs on a random node in a cluster. On this page, I am goin...

open_in_new View