Spark Read from SQL Server Source using Windows/Kerberos Authentication

In this article, I am going to show you how to use JDBC Kerberos authentication to connect to SQL Server sources in Spark (PySpark). I will use Kerberos connection with principal names and password directly that requires Microsoft JDBC Driver 6.2 or above. The sample code can run on Windows, Linux and Mac-OS platforms. 

If you want to work with lower version of Microsft JDBC driver, I already explained in details about the different approaches to authenticate with SQL Server database in my previous articles.  Refer to the following links for more information:

Connect to SQL Server in Spark (PySpark) - Kontext

JDBC Integrated Security, NTLM and Kerberos Authentication for SQL Server


Prerequisites

Microsoft JDBC Driver 6.2 +

Download JDBC driver (6.2. or above) from the following location:

Download Microsoft JDBC Driver for SQL Server

I use JDBC driver 7.4.1 with JRE8. Once downloaded, install the exe or extract the zip file to a directory. 

Spark 2.x

I'm using SparkSession object in the following sample code which is only available since Spark 2.x.

Sample code

We just need to ensure JDBC URL is in the following format:

jdbc:sqlserver://servername=server_name;integratedSecurity=true;authenticationScheme=JavaKerberos;userName=user@REALM;password=****
For userName and password we can also specify it in SparkSession.read APIs.

Complete sample

Please create a PySpark script file pypark-sql-jdbc-kerberos.py with the following content:

from pyspark.sql import SparkSession

# Create spark session
spark = SparkSession.builder \
    .appName('Spark - SQL Server Kerberos Integrated Authentication Example') \
    .master('local') \
    .getOrCreate()

# Change this to your LANID in your organization domain
user = "raymond"
# Change this to your ogranization domain
domain = "GLOBAL.KONTEXT.TECH"
# Change to your LANID password
pwd = "password"

# Change to your own JDBC URL accordingly especially the server address and port
jdbc_url = """jdbc:sqlserver://MSSQL.GLOBAL.KONTEXT.TECH:1433;
integratedSecurity=true;authenticationScheme=JavaKerberos;
"""

# Change to your own SQL select statement
sql = """
SELECT * FROM sys.tables
"""

# Now read data
df = spark.read \
    .format('jdbc') \
    .option('driver','com.microsoft.sqlserver.jdbc.SQLServerDriver') \
    .option('url',jdbc_url) \
    .option('dbtable', '({sql} as src)'.format(sql=sql)) \
    .option('user','{user}@{domain}'.format(user=user,domain=domain)) \
    .option('password', pwd) \
    .load()

df.show()

Remember change the variables accordingly to match with your own environment.

Run the code

Before we run the code, we need to ensure JDBC driver is passed in accordingly. Use the following command to run the application:
spark-submit --jars sqljdbc_7.4\enu\mssql-jdbc-7.4.1.jre8.jar pypark-sql-jdbc-kerberos.py
Make sure jar file path matches the JDBC driver file location of your environment. There can be multiple jar files for different JRE. In the above command, I am using the version for JRE8.

Reference

Refer to the following page about more information regards to Kerberos integrated authentication: 

Using Kerberos integrated authentication to connect to SQL Server

Let me know if you have any questions. 

info Last modified by Raymond at 4 months ago * This page is subject to Site terms.

More from Kontext

Azure SQL Database Automated Backup Strategy

local_offer Azure local_offer SQL Server

visibility 84
thumb_up 0
access_time 3 months ago

When designing the architecture of Kontext platform, Azure SQL Database is chosen as the storage for relational data. TDE and other advanced security features are always enabled to protect the database. Backup plans are also employed to ensure I can always restore the database for as point of tim...

open_in_new Azure

local_offer Java local_offer python local_offer SQL Server

visibility 1489
thumb_up 0
access_time 3 months ago

In my previous article  Connect to SQL Server via JayDeBeApi in Python , I showed examples of u...

open_in_new Python Programming

PySpark Read Multiple Lines Records from CSV

local_offer pyspark local_offer spark-2-x local_offer python

visibility 444
thumb_up 0
access_time 4 months ago

CSV is a common format used when extracting and exchanging data between systems and platforms. Once CSV file is ingested into HDFS, you can easily read them as DataFrame in Spark. However there are a few options you need to pay attention to especially if you source file: Has records ac...

open_in_new Spark + PySpark

local_offer pyspark local_offer spark-2-x local_offer teradata local_offer SQL Server

visibility 1329
thumb_up 0
access_time 4 months ago

In my previous article about  Connect to SQL Server in Spark (PySpark) , I mentioned the ways t...

open_in_new Spark + PySpark

info About author

comment Comments (0)

comment Add comment

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

No comments yet.

Dark theme mode

Dark theme mode is available on Kontext.

Learn more arrow_forward

Kontext Column

Created for everyone to publish data, programming and cloud related articles. Follow three steps to create your columns.


Learn more arrow_forward