This site uses cookies to deliver our services. By using this site, you acknowledge that you have read and understand our Cookie and Privacy policy. Your use of Kontext website is subject to this policy. Allow Cookies and Dismiss

Load Data into HDFS from SQL Server via Sqoop

578 views 0 comments last modified about 7 months ago Raymond Tang

SQL Server spark hdfs parquet sqoop

This page shows how to import data from SQL Server into Hadoop via Apache Sqoop.

Prerequisites

Please follow the link below to install Sqoop in your machine if you don’t have one environment ready.

Install Apache Sqoop in Windows

Sqoop CLI

Available commands

Use the following command in Command Prompt, you will be able to find out all the available commands.

%SQOOP_HOME%\bin\sqoop.cmd help

In v1.4.7, the following commands are available.

Available commands:
   codegen            Generate code to interact with database records
   create-hive-table  Import a table definition into Hive
   eval               Evaluate a SQL statement and display the results
   export             Export an HDFS directory to a database table
   help               List available commands
   import             Import a table from a database to HDFS
   import-all-tables  Import tables from a database to HDFS
   import-mainframe   Import datasets from a mainframe server to HDFS
   job                Work with saved jobs
   list-databases     List available databases on a server
   list-tables        List available tables in a database
   merge              Merge results of incremental imports
   metastore          Run a standalone Sqoop metastore
   version            Display version information

See 'sqoop help COMMAND' for information on a specific command.

In this example, I am going to use command import.

Sqoop import command

The details of this command are available in the following documentation.

http://sqoop.apache.org/docs/1.4.7/SqoopUserGuide.html#_literal_sqoop_import_literal

Create the script file

Purpose

For this tutorial, I am going to create a cmd script file to call this command to import data from the following SQL Server:

  • Server: localhost
  • Database: Mecury_Db
  • Login: zeppelin
  • password: zeppelin
  • Select command:

SELECT [BlogId]

, [SubTitle]

, [Title]

, [UniqueName]

FROM [Mercury_Db].[context].[Blogs]

The above query should return the following result:

image

Script file

Create a script file sqoop-test.cmd with the following content:

%SQOOP_HOME%\bin\sqoop.cmd import --connect "jdbc:sqlserver://localhost;databaseName=Mercury_Db" ^
--username zeppelin ^
--password zeppelin ^
--query "SELECT [BlogId], [SubTitle], [Title], [UniqueName] FROM [context].[Blogs] WHERE $CONDITIONS" ^
--split-by BlogId ^
--columns "BlogId,SubTitle,Title,UniqueName" ^
--num-mappers 2 ^
--map-column-java "BlogId=Integer,SubTitle=String,Title=String,UniqueName=String" ^
--target-dir "sqoop_test/blogs" ^
--delete-target-dir  ^
--as-parquetfile

In the above script, Sqoop will use user name zeppelin and password zeppelin to connect the localhost server with default database set as Mercury_Db.

The command use query instead of table parameter. The driver will initiate two mappers to extract data which is split by column BlogId.

The data will be kept in sqoop_test/blogs folder of the running user (proxy user). The file format is parquet.  If the folder exists in HDFS already, it will be deleted first.

Run the script file

Before you run the script file, make sure Hadoop dfs and yarn daemons are started.

The following is a running result screen for this job:

image

You can also track the job status in YRAN:

image

As you can see, it is a Map Reduce application.

Verify result

We can verify the result through name node web UI or CLI.

image

As we can see, the file is imported as Parquet format.

Why there are two files named *.parquet?

Think about why there are two *.parquet files.

Change the value of parameter --num-mappers in the script file to 3 and then run the script again.

Check again to find out how many *.parquet files have been created.

Change the script file to write output as text file and you will find the results are split into three files.

image

Verify result in Zeppelin

Use the following code in Zeppelin to read the created parquet file:

%spark
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, SQLContext}
import spark.implicits._

// Read parquet
val sqlContext = new SQLContext(sc)
val df=sqlContext.read.parquet("hdfs://0.0.0.0:19000/user/fahao/sqoop_test/blogs")
z.show(df)

image

If you import file as text files, you can use the following code:

%spark
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, SQLContext}
import spark.implicits._

// Read parquet
val sqlContext = new SQLContext(sc)
// val df=sqlContext.read.parquet("hdfs://0.0.0.0:19000/user/fahao/sqoop_test/blogs")
val df=sqlContext.read.format("csv").load("hdfs://0.0.0.0:19000/user/fahao/sqoop_test/blogs").toDF()
z.show(df)

The above code read the input file using format CSV and then convert RDD to data frame.

Summary

It is very convenient to use Sqoop to import data from SQL Server into HDFS and vice versa. You can also use Sqoop to import data as other formats, i.e. plain text file, Avro Data Files and SequenceFiles.

In this particular example, Sqoop generate a Java class in temporary space, which is then used by Map Reduce job (there is no ‘reduce’ required for this scenario).

public class QueryResult extends SqoopRecord  implements DBWritable, Writable {

}

The application is then submitted to YARN to execute.

In fact, you can also write your own MapReduce application to implement the same function if you don’t want to use Sqoop.

Related pages

Install Hadoop 3.0.0 in Windows (Single Node)

7022 views   14 comments last modified about 9 months ago

This page summarizes the steps to install Hadoop 3.0.0 in your Windows environment. Reference page: https://wiki.apache.org/hadoop/Hadoop2OnWindows ...

View detail

Write and Read Parquet Files in Spark/Scala

3441 views   2 comments last modified about 9 months ago

In this page, I’m going to demonstrate how to write and read parquet files in Spark/Scala by using Spark SQLContext class. Reference What is parquet format? Go the following project site to understand more about parquet. ...

View detail

Resolve Hadoop RemoteException - Name node is in safe mode

120 views   0 comments last modified about 7 months ago

In Safe Mode, the HDFS cluster is read-only. After completion of block replication maintenance activity, the name node leaves safe mode automatically. If you try to delete files in safe mode, the following exception may raise: org.apache.hadoop.ipc.RemoteException(org.apac...

View detail

Configure Sqoop in a Edge Node of Hadoop Cluster

622 views   0 comments last modified about 7 months ago

This page continues with the following documentation about configuring a Hadoop multi-nodes cluster via adding a new edge node to configure administration or client tools. ...

View detail

Configure Sqoop in a Edge Node of Hadoop Cluster

622 views   0 comments last modified about 7 months ago

This page continues with the following documentation about configuring a Hadoop multi-nodes cluster via adding a new edge node to configure administration or client tools. ...

View detail

Configure YARN and MapReduce Resources in Hadoop Cluster

322 views   0 comments last modified about 7 months ago

When configuring YARN and MapReduce in Hadoop cluster, it is very important to configure the memory and virtual processors correctly. If the configurations are incorrect, the nodes may not be able to start properly and the applications may not be able to run successfully. For example...

View detail

Add comment

Please login first to add comments.  Log in New user?  Register

Comments (0)

No comments yet.