Raymond Raymond | Sqoop

Configure Sqoop in a Edge Node of Hadoop Cluster

event 2018-05-12 visibility 5,572 comment 0 insights toc
insights Stats

This page continues with the following documentation about configuring a Hadoop multi-nodes cluster via adding a new edge node to configure administration or client tools.

Configure Hadoop 3.1.0 in a Multi Node Cluster

In this page, I’m going to show you how to add a edge node into the previous cluster and then install Sqoop in that node. The architecture of the cluster looks like the following diagram:

Hadoop cluster with Edge nodes

* In the edge node, you can install other client tools or administration tools beside Sqoop. This page focuses on Sqoop and please stay tuned for guidance about configuring other tools.


Apache Sqoop

The steps for configuring Sqoop in a Edge node are very similar to configure it in a single node server.

Install Apache Sqoop in Windows

Please refer the above page for more detailed steps.

SSH and firewall configuration

In my previous page about Configure Hadoop 3.1.0 in a Multi Node Cluster, it already documents every detailed information about how to setup firewall rules and SSH.

This document will only provides high-level guidance.

Step-by-step guide

Configure alias

Edit /etc/hosts in all the three servers to make sure the following configurations exist: hdp-master hdp-slave-1 hdp-edge-1

Configure firewall

Configure the internet interface of server hdp-edge-1 as ‘Internal Zone’.

Create a new system user

Use the following commands to create a new user in edge node server:

tangf@192-168-192-134:~> sudo useradd -m hadoop

We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:

    #1) Respect the privacy of others.
     #2) Think before you type.
     #3) With great power comes great responsibility.

[sudo] password for root:
tangf@192-168-192-134:~> sudo passwd hadoop
New password:
BAD PASSWORD: it is based on a dictionary word
BAD PASSWORD: is too simple
Retype new password:
passwd: password updated successfully

Configure ssh

Enable ssh

Follow the same steps in the following link section to enable ssh in the edge node server:

Enable ssh

Generate ssh key

In hdp-master and hdp-slave-1 server, use the following command to copy the public key generated previously to hdp-edge-1 server:

ssh-copy-id -i ~/.ssh/hdp-key.pub hadoop@hdp-edge-1


You can verify SSH configuration by running the following command in hdp-master:

hadoop@hdp-master:/home/tangf> ssh hadoop@hdp-edge-1 -i ~/.ssh/hdp-key
Have a lot of fun...

Now, all the three servers can SSH to each other without providing the password.

Configure environment variables

Follow the following link section to configure environment variables in hdp-edge-1 (edge node server):

Configure environment variables

The commands you need to run are:

tangf@192-168-192-134:~> su hadoop
hadoop@hdp-edge-1:/home/tangf> whoami
hadoop@hdp-edge-1:/home/tangf> vi ~/.profile
hadoop@hdp-edge-1:/home/tangf> vi ~/.bashrc
hadoop@hdp-edge-1:/home/tangf> source ~/.bashrc
hadoop@hdp-edge-1:/home/tangf> source ~/.profile

Install Hadoop client in edge node

Create Hadoop folder in the edge node

Run the following commands to create hadoop folder in the edge node server (hdp-egde-1):

tangf@192-168-192-134:/> su root
hdp-edge-1:/ # whoami
hdp-edge-1:/ # mkdir /hadoop
hdp-edge-1:/ # cd /hadoop
hdp-edge-1:/hadoop # chown -R hadoop /hadoop
hdp-edge-1:/hadoop #

Copy folder

Similarly to Create folder, run the following commands in hdp-master server to copy Hadoop files to the edge node server:

su hadoop

cd /

scp -rv -i "~/.ssh/hdp-key" /hadoop  hadoop@hdp-edge-1:/

Install Sqoop in the edge node

Now, it is time to install Sqoop in the edge node server. The installation approach is similar to the following link though I’m going to use command line to download the binary package directly.

Install Apache Sqoop in Windows

Download and unzip binary package

Run the following commands in the edge node server (hdp-edge-1) to download Sqoop binary package and unzip:

su hadoop

cd /hadoop

wget http://mirror.ventraip.net.au/apache/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz

tar -xvzf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz

Configure environment variables

Edit ~/.profile and ~/.bashrc  in the edge node server to add the following lines:

export SQOOP_HOME=/hadoop/sqoop-1.4.7.bin__hadoop-2.6.0
export PATH=${PATH}:${SQOOP_HOME}/bin

The commands are:

hadoop@hdp-edge-1:/hadoop> vi ~/.profile
hadoop@hdp-edge-1:/hadoop> vi ~/.bashrc
hadoop@hdp-edge-1:/hadoop> source ~/.profile
hadoop@hdp-edge-1:/hadoop> source ~/.bashrc

Run Sqoop config command

Run the following command in the edge node server:

hadoop@hdp-edge-1:/hadoop/sqoop-1.4.7.bin__hadoop-2.6.0> $SQOOP_HOME/bin/configure-sqoop
Warning: /hadoop/sqoop-1.4.7.bin__hadoop-2.6.0/../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /hadoop/sqoop-1.4.7.bin__hadoop-2.6.0/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /hadoop/sqoop-1.4.7.bin__hadoop-2.6.0/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /hadoop/sqoop-1.4.7.bin__hadoop-2.6.0/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.

There are warning messages because I have not installed hbase, hcatalog, accumulo or zookeeper in this edge node.

Alternative approach- Build Sqoop

If you don’t want to use the pre-built version, you can build Sqoop. To do that, you need to install Apache Ant in the server.

Download the source code using the following command:

wget http://mirror.ventraip.net.au/apache/sqoop/1.4.7/sqoop-1.4.7.tar.gz

Ant can be installed using the following commands:

hadoop@hdp-edge-1:/hadoop> wget https://www.apache.org/dist/ant/binaries/apache-ant-1.10.3-bin.tar.gz

hadoop@hdp-edge-1:/hadoop> tar -xzvf apache-ant-1.10.3-bin.tar.gz

hadoop@hdp-edge-1:/hadoop>export ANT_OPTS="-Xmx256M"

hadoop@hdp-edge-1:/hadoop>export ANT_HOME=/hadoop/apache-ant-1.10.3/

hadoop@hdp-edge-1:/hadoop>export PATH=${ANT_HOME}/bin:${JAVA_HOME}/bin:${PATH}

Once Ant is installed, run the following commands to build Sqoop:

ant package

Wait until the build is completed.

Once the build is completed, please changed the environment variables accordingly:

export SQOOP_HOME=/hadoop/sqoop-1.4.7
export PATH=${PATH}:${SQOOP_HOME}/bin

Verify Sqoop in edge node

Verify through command

Run the following command in the edge server to verify Sqoop:

hadoop@hdp-edge-1:/hadoop/sqoop-1.4.7.bin__hadoop-2.6.0> sqoop version
Warning: /hadoop/sqoop-1.4.7.bin__hadoop-2.6.0/../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /hadoop/sqoop-1.4.7.bin__hadoop-2.6.0/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /hadoop/sqoop-1.4.7.bin__hadoop-2.6.0/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /hadoop/sqoop-1.4.7.bin__hadoop-2.6.0/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
2018-05-13 01:17:27,372 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
Sqoop 1.4.7
git commit id 2328971411f57f0cb683dfb79d19d4d19d185dd8
Compiled by maugli on Thu Dec 21 15:59:58 STD 2017

Use Sqoop to import data from RDBMS to HDFS

Create the script file

To very Sqoop, I am going to run the following command:

The command is extracted from Load Data into HDFS from SQL Server via Sqoop.

Create a shell script file named run_sqoop_example.sh with the following content:

sqoop import --connect "jdbc:sqlserver://;databaseName=Mercury_Db" \
--username zeppelin \
--password zeppelin \
--query "SELECT [BlogId], [SubTitle], [Title], [UniqueName] FROM [context].[Blogs] WHERE \$CONDITIONS" \
--split-by BlogId \
--columns "BlogId,SubTitle,Title,UniqueName" \
--num-mappers 2 \
--map-column-java "BlogId=Integer,SubTitle=String,Title=String,UniqueName=String" \
--target-dir "sqoop_test/blogs" \
--delete-target-dir  \

The commands to create this file:

hadoop@hdp-edge-1:/hadoop> touch run_sqoop_example.sh

hadoop@hdp-edge-1:/hadoop> vi run_sqoop_example.sh

hadoop@hdp-edge-1:/hadoop> chmod +x run_sqoop_example.sh

Run the script file

Run the script file to load data from SQL Server to HDFS.

hadoop@hdp-edge-1:/hadoop> ./run_sqoop_example.sh

While it is executing, you can monitor the job through YARN web UI (the following screenshots are just samples):



The standard output from the above command line:

hadoop@hdp-edge-1:/hadoop> ./run_sqoop_example.sh
Warning: /hadoop/sqoop-1.4.7.bin__hadoop-2.6.0/../hbase does not exist! HBase imports will fail.                                                                                                                                                                              
Please set $HBASE_HOME to the root of your HBase installation.                                                                                                                                                                                                                
Warning: /hadoop/sqoop-1.4.7.bin__hadoop-2.6.0/../hcatalog does not exist! HCatalog jobs will fail.                                                                                                                                                                           
Please set $HCAT_HOME to the root of your HCatalog installation.                                                                                                                                                                                                              
Warning: /hadoop/sqoop-1.4.7.bin__hadoop-2.6.0/../accumulo does not exist! Accumulo imports will fail.                                                                                                                                                                        
Please set $ACCUMULO_HOME to the root of your Accumulo installation.                                                                                                                                                                                                          
Warning: /hadoop/sqoop-1.4.7.bin__hadoop-2.6.0/../zookeeper does not exist! Accumulo imports will fail.                                                                                                                                                                       
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.                                                                                                                                                                                                        
2018-05-13 15:31:15,055 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7                                                                                                                                                                                                        
2018-05-13 15:31:15,087 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.                                                                                                                                            
2018-05-13 15:31:15,184 INFO manager.SqlManager: Using default fetchSize of 1000                                                                                                                                                                                              
2018-05-13 15:31:15,184 INFO tool.CodeGenTool: Beginning code generation                                                                                                                                                                                                      
2018-05-13 15:31:15,565 INFO manager.SqlManager: Executing SQL statement: SELECT [BlogId], [SubTitle], [Title], [UniqueName] FROM [context].[Blogs] WHERE  (1 = 0)                                                                                                            
2018-05-13 15:31:15,665 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /hadoop/hadoop-3.1.0                                                                                                                                                                               
Note: /tmp/sqoop-hadoop/compile/b954d25164a27a937bfddc85ae230d67/QueryResult.java uses or overrides a deprecated API.                                                                                                                                                         
Note: Recompile with -Xlint:deprecation for details.                                                                                                                                                                                                                          
2018-05-13 15:31:20,950 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/b954d25164a27a937bfddc85ae230d67/QueryResult.jar                                                                                                                             
2018-05-13 15:31:22,396 INFO tool.ImportTool: Destination directory sqoop_test/blogs deleted.                                                                                                                                                                                 
2018-05-13 15:31:22,398 INFO mapreduce.ImportJobBase: Beginning query import.                                                                                                                                                                                                 
2018-05-13 15:31:22,399 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address                                                                                                                                           
2018-05-13 15:31:22,430 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar                                                                                                                                                              
2018-05-13 15:31:22,527 INFO manager.SqlManager: Executing SQL statement: SELECT [BlogId], [SubTitle], [Title], [UniqueName] FROM [context].[Blogs] WHERE  (1 = 0)                                                                                                            
2018-05-13 15:31:22,529 INFO manager.SqlManager: Executing SQL statement: SELECT [BlogId], [SubTitle], [Title], [UniqueName] FROM [context].[Blogs] WHERE  (1 = 0)                                                                                                            
2018-05-13 15:31:22,748 WARN spi.Registration: Not loading URI patterns in org.kitesdk.data.spi.hive.Loader                                                                                                                                                                   
2018-05-13 15:31:23,237 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
2018-05-13 15:31:23,464 INFO client.RMProxy: Connecting to ResourceManager at hdp-master/
2018-05-13 15:31:24,165 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1526188414553_0003
2018-05-13 15:31:27,445 INFO db.DBInputFormat: Using read commited transaction isolation
2018-05-13 15:31:27,445 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(BlogId), MAX(BlogId) FROM (SELECT [BlogId], [SubTitle], [Title], [UniqueName] FROM [context].[Blogs] WHERE  (1 = 1) ) AS t1
2018-05-13 15:31:27,455 INFO db.IntegerSplitter: Split size: 2; Num splits: 2 from: 100 to: 104
2018-05-13 15:31:27,572 INFO mapreduce.JobSubmitter: number of splits:2
2018-05-13 15:31:27,795 INFO Configuration.deprecation: mapred.task.timeout is deprecated. Instead, use mapreduce.task.timeout
2018-05-13 15:31:27,796 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
2018-05-13 15:31:28,218 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1526188414553_0003
2018-05-13 15:31:28,220 INFO mapreduce.JobSubmitter: Executing with tokens: []
2018-05-13 15:31:28,675 INFO conf.Configuration: resource-types.xml not found
2018-05-13 15:31:28,675 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2018-05-13 15:31:28,811 INFO impl.YarnClientImpl: Submitted application application_1526188414553_0003
2018-05-13 15:31:28,852 INFO mapreduce.Job: The url to track the job: http://hdp-master:8088/proxy/application_1526188414553_0003/
2018-05-13 15:31:28,853 INFO mapreduce.Job: Running job: job_1526188414553_0003
2018-05-13 15:31:34,973 INFO mapreduce.Job: Job job_1526188414553_0003 running in uber mode : false
2018-05-13 15:31:34,993 INFO mapreduce.Job:  map 0% reduce 0%
2018-05-13 15:31:43,087 INFO mapreduce.Job:  map 50% reduce 0%
2018-05-13 15:31:51,153 INFO mapreduce.Job:  map 100% reduce 0%
2018-05-13 15:31:52,176 INFO mapreduce.Job: Job job_1526188414553_0003 completed successfully
2018-05-13 15:31:52,325 INFO mapreduce.Job: Counters: 32
         File System Counters
                 FILE: Number of bytes read=0
                 FILE: Number of bytes written=448572
                 FILE: Number of read operations=0
                 FILE: Number of large read operations=0
                 FILE: Number of write operations=0
                 HDFS: Number of bytes read=14859
                 HDFS: Number of bytes written=7344
                 HDFS: Number of read operations=100
                 HDFS: Number of large read operations=0
                 HDFS: Number of write operations=20
         Job Counters
                 Launched map tasks=2
                 Other local map tasks=2
                 Total time spent by all maps in occupied slots (ms)=22902
                 Total time spent by all reduces in occupied slots (ms)=0
                 Total time spent by all map tasks (ms)=11451
                 Total vcore-milliseconds taken by all map tasks=11451
                 Total megabyte-milliseconds taken by all map tasks=12046452
         Map-Reduce Framework
                 Map input records=5
                 Map output records=5
                 Input split bytes=213
                 Spilled Records=0
                 Failed Shuffles=0
                 Merged Map outputs=0
                 GC time elapsed (ms)=254
                 CPU time spent (ms)=2490
                 Physical memory (bytes) snapshot=313700352
                 Virtual memory (bytes) snapshot=5118722048
                 Total committed heap usage (bytes)=65142784
                 Peak Map Physical memory (bytes)=165302272
                 Peak Map Virtual memory (bytes)=2559361024
         File Input Format Counters
                 Bytes Read=0
         File Output Format Counters
                 Bytes Written=0
2018-05-13 15:31:52,331 INFO mapreduce.ImportJobBase: Transferred 7.1719 KB in 29.0861 seconds (252.4916 bytes/sec)
2018-05-13 15:31:52,335 INFO mapreduce.ImportJobBase: Retrieved 5 records.

Verify in HDFS

Run the following commands in any of these three servers to verify:

hadoop@hdp-edge-1:~> hdfs dfs -ls /user/hadoop/sqoop_test/blogs
Found 4 items
drwxr-xr-x   - hadoop supergroup          0 2018-05-13 15:31 /user/hadoop/sqoop_test/blogs/.metadata
drwxr-xr-x   - hadoop supergroup          0 2018-05-13 15:31 /user/hadoop/sqoop_test/blogs/.signals
-rw-r--r--   2 hadoop supergroup       1750 2018-05-13 15:31 /user/hadoop/sqoop_test/blogs/7249d8a1-4464-4303-bfdd-9258a992be0a.parquet
-rw-r--r--   2 hadoop supergroup       2182 2018-05-13 15:31 /user/hadoop/sqoop_test/blogs/a6d83f56-5f46-45a7-9d74-5fced4d6a0a5.parquet

As we can see there are two parquet files created (since we are using two map tasks when running the Sqoop command line.)

Error debugs

Sqoop requires a JDK that can compile Java code.

If you get the following error when running the above sample script, you need to install JDK in your server.

2018-05-13 11:25:47,110 ERROR orm.CompilationManager: It seems as though you are running sqoop with a JRE.
2018-05-13 11:25:47,110 ERROR orm.CompilationManager: Sqoop requires a JDK that can compile Java code.
2018-05-13 11:25:47,110 ERROR orm.CompilationManager: Please install a JDK and set $JAVA_HOME to use it.
2018-05-13 11:25:47,111 ERROR tool.ImportTool: Import failed: java.io.IOException: Could not start Java compiler.
         at org.apache.sqoop.orm.CompilationManager.compile(CompilationManager.java:196)
         at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:107)
         at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:501)
         at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:628)
         at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
         at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
         at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
         at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
         at org.apache.sqoop.Sqoop.main(Sqoop.java:252)

To install JDK, download from the following page:

And then upload the file into the /hadoop folder of the edge server.

Run the following commands to install:

hadoop@hdp-edge-1:/hadoop> tar -xvzf jdk-8u171-linux-x64.tar.gz

hadoop@hdp-edge-1:/hadoop/jdk1.8.0_171> which java
hadoop@hdp-edge-1:/hadoop/jdk1.8.0_171> which javaws
hadoop@hdp-edge-1:/hadoop/jdk1.8.0_171> sudo chmod -x /usr/bin/java*
[sudo] password for root:

hadoop@hdp-edge-1:/hadoop/jdk1.8.0_171> cd ..

hadoop@hdp-edge-1:/hadoop> sudo mv jdk1.8.0_171 /opt/

hadoop@hdp-edge-1:/opt> sudo ln -s jdk1.8.0_171 java

hadoop@hdp-edge-1:/opt> export PATH=$PATH:/opt/java/bin

hadoop@hdp-edge-1:/opt> which java
hadoop@hdp-edge-1:/opt> java -version
java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)

hadoop@hdp-edge-1:/hadoop> export JAVA_HOME=/opt/java

To make JAVA_HOME environment permanent, please ~/.profile and ~/.bashrc  in the edge node server to add the following lines:

export JAVA_HOME=/opt/java
export PATH=${PATH}:${JAVA_HOME}/bin

Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster

Make sure the following settings exists in mapred-site.xml:


The path value is the Hadoop distribution path.

Container is running beyond memory limits

To resolve this issue, you need configure the memory allocation related parameters in YARN (yarn-site.xml).

For example, the following YARN configuration is based on the environment I am using:


If you want to understand more about these resource configurations, please refer to Configure YARN and MapReduce Resources in Hadoop Cluster .

Error: java.lang.RuntimeException: java.lang.RuntimeException: com.microsoft.sqlserver.jdbc.SQLServerException: The TCP/IP connection to the host…

You may encounter the following error if your node servers have no access to the RSBMS server specified:

Error: java.lang.RuntimeException: java.lang.RuntimeException: com.microsoft.sqlserver.jdbc.SQLServerException: The TCP/IP connection to the host, port 1433 has failed. Error: "Connection timed out: no further information. Verify the connection properties. Make sure that an instance of SQL Server is running on the host and accepting TCP/IP connections at the port. Make sure that TCP connections to the port are not blocked by a firewall.".
         at org.apache.sqoop.mapreduce.db.DBInputFormat.setDbConf(DBInputFormat.java:170)
         at org.apache.sqoop.mapreduce.db.DBInputFormat.setConf(DBInputFormat.java:161)
         at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:77)
         at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:137)
         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
         at java.security.AccessController.doPrivileged(Native Method)
         at javax.security.auth.Subject.doAs(Subject.java:422)
         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)

Because all the map and reduce tasks run in the node servers, you need to ensure the firewall for the corresponding IP and port is open.

For my case, I just need to configure one inbound firewall rule in my SQL Server machine:



Sqoop means SQL t Hadoop. Once it is installed in the edge node server, we can use it to import data from RDBMS to HDFS, HIVE and etc., vice versa.

Sqoop is using MapReduce to submit these jobs to YARN and then YARN allocate resources to execute the application. This kind of suggests that Sqoop will utilize the resources in your cluster.

Have a lot of fun with Sqoop.

More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts