Spark Read from SQL Server Source using Windows/Kerberos Authentication

access_time 8 months ago visibility904 comment 0

In this article, I am going to show you how to use JDBC Kerberos authentication to connect to SQL Server sources in Spark (PySpark). I will use Kerberos connection with principal names and password directly that requires Microsoft JDBC Driver 6.2 or above. The sample code can run on Windows, Linux and Mac-OS platforms. 

If you want to work with lower version of Microsft JDBC driver, I already explained in details about the different approaches to authenticate with SQL Server database in my previous articles.  Refer to the following links for more information:

Connect to SQL Server in Spark (PySpark) - Kontext

JDBC Integrated Security, NTLM and Kerberos Authentication for SQL Server


Prerequisites

Microsoft JDBC Driver 6.2 +

Download JDBC driver (6.2. or above) from the following location:

Download Microsoft JDBC Driver for SQL Server

I use JDBC driver 7.4.1 with JRE8. Once downloaded, install the exe or extract the zip file to a directory. 

Spark 2.x

I'm using SparkSession object in the following sample code which is only available since Spark 2.x.

Sample code

We just need to ensure JDBC URL is in the following format:

jdbc:sqlserver://servername=server_name;integratedSecurity=true;authenticationScheme=JavaKerberos;userName=user@REALM;password=****
For userName and password we can also specify it in SparkSession.read APIs.

Complete sample

Please create a PySpark script file pypark-sql-jdbc-kerberos.py with the following content:

from pyspark.sql import SparkSession

# Create spark session
spark = SparkSession.builder \
    .appName('Spark - SQL Server Kerberos Integrated Authentication Example') \
    .master('local') \
    .getOrCreate()

# Change this to your LANID in your organization domain
user = "raymond"
# Change this to your ogranization domain
domain = "GLOBAL.KONTEXT.TECH"
# Change to your LANID password
pwd = "password"

# Change to your own JDBC URL accordingly especially the server address and port
jdbc_url = """jdbc:sqlserver://MSSQL.GLOBAL.KONTEXT.TECH:1433;
integratedSecurity=true;authenticationScheme=JavaKerberos;
"""

# Change to your own SQL select statement
sql = """
SELECT * FROM sys.tables
"""

# Now read data
df = spark.read \
    .format('jdbc') \
    .option('driver','com.microsoft.sqlserver.jdbc.SQLServerDriver') \
    .option('url',jdbc_url) \
    .option('dbtable', '({sql} as src)'.format(sql=sql)) \
    .option('user','{user}@{domain}'.format(user=user,domain=domain)) \
    .option('password', pwd) \
    .load()

df.show()

Remember change the variables accordingly to match with your own environment.

Run the code

Before we run the code, we need to ensure JDBC driver is passed in accordingly. Use the following command to run the application:
spark-submit --jars sqljdbc_7.4\enu\mssql-jdbc-7.4.1.jre8.jar pypark-sql-jdbc-kerberos.py
Make sure jar file path matches the JDBC driver file location of your environment. There can be multiple jar files for different JRE. In the above command, I am using the version for JRE8.

Reference

Refer to the following page about more information regards to Kerberos integrated authentication: 

Using Kerberos integrated authentication to connect to SQL Server

Let me know if you have any questions. 

info Last modified by Administrator at 2 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Kontext Column

Created for everyone to publish data, programming and cloud related articles.
Follow three steps to create your columns.


Learn more arrow_forward

More from Kontext

local_offer spark local_offer hadoop local_offer pyspark local_offer oozie local_offer hue

visibility 3045
thumb_up 0
access_time 2 years ago

When submitting Spark applications to YARN cluster, two deploy modes can be used: client and cluster. For client mode (default), Spark driver runs on the machine that the Spark application was submitted while for cluster mode, the driver runs on a random node in a cluster. On this page, I am going ...

local_offer python local_offer spark local_offer hadoop local_offer pyspark

visibility 1382
thumb_up 0
access_time 2 years ago

In one of my previous articles about Password Security Solution for Sqoop , I mentioned creating credential using hadoop credential command. The credentials are stored in JavaKeyStoreProvider. Credential providers are used to separate the use of sensitive tokens, secrets and passwords from the ...

local_offer python local_offer spark local_offer pyspark local_offer spark-advanced

visibility 8590
thumb_up 0
access_time 2 years ago

For SQL developers that are familiar with SCD and merge statements, you may wonder how to implement the same in big data platforms, considering database or storages in Hadoop are not designed/optimised for record level updates and inserts. In this post, I’m going to demonstrate how to implement ...

About column

Apache Spark installation guides, performance tuning tips, general tutorials, etc.

*Spark logo is a registered trademark of Apache Spark.

rss_feed Subscribe RSS