Spark Scala: Load Data from Teradata

access_time 29 days ago visibility26 comment 0

In my article Connect to Teradata database through Python, I demonstrated about how to use Teradata python package or Teradata ODBC driver to connect to Teradata. In this article, I’m going to show you how to connect to Teradata through JDBC drivers so that you can load data directly into Spark data frames using Scala as programming language.

infoThe following code snippet runs in spark-shell. You can also create a Scala project and then compile it to JAR file and then use spark-submit to run the application. The code can also run in notebooks like Zeppelin. 

Prerequisites

Teradata JDBC driver is required. Please download it from Teradata official website.

Start session with Teradata

First, start Spark session using Spark Shell and pass in your Teradata JDBC driver through --jars argument:

spark-shell --jars F:\lib\terajdbc4.jar -c spark.ui.port=11111

Teradata JDBC driver locates at F:\lib\terajdbc4.jar in my system.

Load data from Teradata

Once the session is started, follow these steps to load data from Teradata.

1) Import types

import org.apache.spark.sql._
import org.apache.spark.sql.types._

2) Create DataFrame

val jdbcDF = (spark.read.format("jdbc")
.option("url", "jdbc:teradata://192.168.119.128/Database=dbc,LOGMECH=td2")
.option("dbtable", "(sel * from TestDb.test_table) as src")
.option("user", "dbc")
.option("password", "dbc") 
.option("driver", "com.teradata.jdbc.TeraDriver") 
.load()
)

3) Show the result

scala> jdbcDF.show()
+---+--------+------+
| id|category|amount|
+---+--------+------+
|  5|       B|   101|
|  5|       C|  1001|
|  3|       A|    12|
|  1|       A|    10|
|  6|       B|   102|
|  6|       C|  1002|
|  4|       B|   100|
|  4|       C|  1000|
|  2|       A|    11|
+---+--------+------+

The result is consistent with directly querying data via SQL Assistant:


Some details about the code snippets

In the above example, JDBC connection string is configured to use TD2 as login mechanism. You can also change it to LDAP so that you can use credential in your active directory to logon to the database. 

Depends on the version of your Spark, you may be able to directly use query parameter to pass in your SQL query instead of dbtable. query and dbtable parameters cannot be specified at the same time. In lower version of Spark, you can pass in your SQL as a subquery as I did in the above examples.

Run the code in a cluster

If you are going to run the code in a cluster or workflow tools like Oozie, you can copy these JAR files into HDFS and then pass in the library path or jars arguments as HDFS file paths. In this way, all the workloads can load Teradata JDBC drivers successfully.

SQL Server, Oracle, MySQL…

You can also use similar approach to connect to your SQL Server, Oracle, MySQL or any other JDBC supported databases.

Have fun!

info Last modified by Raymond 29 days ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Follow Kontext

Get our latest updates on LinkedIn or Twitter.

Want to publish your article on Kontext?

Learn more

More from Kontext

visibility 5387
thumb_up 0
access_time 2 years ago

When running pyspark or spark-submit command in Windows to execute python scripts, you may encounter the following error: PermissionError: [WinError 5] Access is denied As it’s self-explained, permissions are not setup correctly. To resolve this issue you can try different approaches: Run ...

visibility 7
thumb_up 0
access_time 22 days ago

In Teradata, error 'no more room in database' can be common especially when data is not evenly distributed (i.e. high skewness). Sometimes when this error occurs, it might be just because of one or a few AMPs are full. For example, assuming a database with permanent space allocated as 100GB and ...

visibility 4185
thumb_up 0
access_time 2 years ago

When submitting Spark applications to YARN cluster, two deploy modes can be used: client and cluster. For client mode (default), Spark driver runs on the machine that the Spark application was submitted while for cluster mode, the driver runs on a random node in a cluster. On this page, I am going ...