Load Data from Teradata in Spark (PySpark)

event 2019-07-06 thumb_up 1 visibility 20,279 comment 2 insights toc

more_vert

warning Please login first to view stats information.

PySpark code
Some details about the code snippets
Run the code
Run the code in a cluster
SQL Server, Oracle, MySQL…

In my article Connect to Teradata database through Python, I demonstrated about how to use Teradata python package or Teradata ODBC driver to connect to Teradata. In this article, I’m going to show you how to connect to Teradata through JDBC drivers so that you can load data directly into PySpark data frames.

PySpark code

Create a PySpark script file named teradata-jdbc.py with the following code:

from pyspark.sql import SparkSession

appName = "PySpark Teradata Example"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

driver = 'com.teradata.jdbc.TeraDriver'

# Define the function to load data from Teradata


def load_data(driver, jdbc_url, sql, user, password):
    return spark.read \
        .format('jdbc') \
        .option('driver', driver) \
        .option('url', jdbc_url) \
        .option('dbtable', '({sql}) as src'.format(sql=sql)) \
        .option('user', user) \
        .option('password', password) \
        .load()

sql = "select * from mydb.mytable"
url = "jdbc:teradata://myserver/Database=mydb,LOGMECH=LDAP"
user = "dbc"
password = "dbc"

df_td = load_data(driver,url,sql,user,password)
df_td.show(10)

Some details about the code snippets

In the above example, JDBC connection string is configured to use LDAP as login mechanism. You can also change it to TD2 so that you can use a Teradata database username and password to connect.

Depends on the version of your Spark, you may be able to directly use query parameter to pass in your SQL query instead of dbtable. query and dbtable parameters cannot be specified at the same time. In lower version of Spark, you can pass in your SQL as a subquery as I did in the above examples.

Run the code

Now you can run the code with the follow command in Spark:

spark2-submit --jars 'your/path/to/teradata/jdbc/drivers/*' teradata-jdbc.py

You need to specify the JARs for Teradata JDBC drivers if you have not done that in your Spark configurations. Two JARs are required:

tdgssconfig.jar
terajdbc4.jar

You can also use different version of Teradata JDBC drivers.

Run the code in a cluster

If you are going to run the code in a cluster or workflow tools like Oozie, you can copy these JAR files into HDFS and then pass in the library path or jars arguments as HDFS file paths. In this way, all the workloads can load Teradata JDBC drivers successfully.

SQL Server, Oracle, MySQL…

You can also use similar approach to connect to your SQL Server, Oracle, MySQL or any other JDBC supported databases.

Have fun!

More from Kontext

info Last modified by Raymond 4 years ago copyright This page is subject to Site terms.

comment Comments

#1501 access_time 3 years ago more_vert

For the latest Teradata JDBC driver, there is only one JAR file required while earlier versions had two JAR file.

If you hit that error in Jupyter, it means you have not added Teradata JDBC driver path to classpath.

spark = SparkSession \
    .builder \
    .appName("Spark App") \
    .config("spark.jars", "/path/to/teradata/jdbc.jar,/path/to/another/jar.jar") \
    .getOrCreate()

format_quote

person venu access_time 3 years ago

Only found this file 'terajdbc4.jar'.

Installed pyspark in Jupyter notebook. Set the class path in environment variable.

But still facing this issue: java.lang.ClassNotFoundException: com.teradata.jdbc.TeraDriver

reply Reply

#1500 access_time 3 years ago more_vert

Only found this file 'terajdbc4.jar'.

Installed pyspark in Jupyter notebook. Set the class path in environment variable.

But still facing this issue: java.lang.ClassNotFoundException: com.teradata.jdbc.TeraDriver

reply Reply

Please log in or register to comment.

account_circle Log in person_add Register

Load Data from Teradata in Spark (PySpark)

insights Stats

toc Table of contents

PySpark code

Some details about the code snippets

Run the code

Run the code in a cluster

SQL Server, Oracle, MySQL…

Log in with external accounts