By using this site, you acknowledge that you have read and understand our Cookie policy, Privacy policy and Terms .

Apache Spark installation guides, performance tuning tips, general tutorials, etc.

rss_feed Subscribe RSS

In this article, I am going to show you how to use JDBC Kerberos authentication to connect to SQL Server sources in Spark (PySpark). I will use Kerberos connection with principal names and password directly that requires Microsoft JDBC Driver 6.2 or above. The sample code can run on Windows, Linux and Mac-OS platforms. 

If you want to work with lower version of Microsft JDBC driver, I already explained in details about the different approaches to authenticate with SQL Server database in my previous articles.  Refer to the following links for more information:

Connect to SQL Server in Spark (PySpark) - Kontext

JDBC Integrated Security, NTLM and Kerberos Authentication for SQL Server


Microsoft JDBC Driver 6.2 +

Download JDBC driver (6.2. or above) from the following location:

Download Microsoft JDBC Driver for SQL Server

I use JDBC driver 7.4.1 with JRE8. Once downloaded, install the exe or extract the zip file to a directory. 

Spark 2.x

I'm using SparkSession object in the following sample code which is only available since Spark 2.x.

Sample code

We just need to ensure JDBC URL is in the following format:

For userName and password we can also specify it in APIs.

Complete sample

Please create a PySpark script file with the following content:

from pyspark.sql import SparkSession

# Create spark session
spark = SparkSession.builder \
    .appName('Spark - SQL Server Kerberos Integrated Authentication Example') \
    .master('local') \

# Change this to your LANID in your organization domain
user = "raymond"
# Change this to your ogranization domain
# Change to your LANID password
pwd = "password"

# Change to your own JDBC URL accordingly especially the server address and port
jdbc_url = """jdbc:sqlserver://MSSQL.GLOBAL.KONTEXT.TECH:1433;

# Change to your own SQL select statement
sql = """
SELECT * FROM sys.tables

# Now read data
df = \
    .format('jdbc') \
    .option('driver','') \
    .option('url',jdbc_url) \
    .option('dbtable', '({sql} as src)'.format(sql=sql)) \
    .option('user','{user}@{domain}'.format(user=user,domain=domain)) \
    .option('password', pwd) \

Remember change the variables accordingly to match with your own environment.

Run the code

Before we run the code, we need to ensure JDBC driver is passed in accordingly. Use the following command to run the application:
spark-submit --jars sqljdbc_7.4\enu\mssql-jdbc-7.4.1.jre8.jar
Make sure jar file path matches the JDBC driver file location of your environment. There can be multiple jar files for different JRE. In the above command, I am using the version for JRE8.


Refer to the following page about more information regards to Kerberos integrated authentication: 

Using Kerberos integrated authentication to connect to SQL Server

Let me know if you have any questions. 

info Last modified by Raymond at 12 days ago

info About author

info License/Terms

More from Kontext

Schema Merging (Evolution) with Parquet in Spark and Hive

local_offer parquet local_offer pyspark local_offer spark-2-x local_offer hive local_offer hdfs

visibility 27
thumb_up 0
access_time 15 days ago

Schema evolution is supported by many frameworks or data serialization systems such as Avro, Orc, Protocol Buffer and Parquet. With schema evolution, one set of data can be stored in multiple files with different but compatible schema. In Spark, Parquet data source can detect and merge schema ...

open_in_new View

local_offer pyspark local_offer spark-2-x local_offer python

visibility 83
thumb_up 0
access_time 2 months ago

This articles show you how to convert a Python dictionary list to a Spark DataFrame. The code snippets runs on Spark 2.x environments. Input The input data (dictionary list looks like the following): data = [{"Category": 'Category A', 'ItemID': 1, 'Amount': 12.40}, ...

open_in_new View

Improve PySpark Performance using Pandas UDF with Apache Arrow

local_offer pyspark local_offer spark local_offer spark-2-x local_offer pandas

visibility 162
thumb_up 4
access_time 2 months ago

Apache Arrow is an in-memory columnar data format that can be used in Spark to efficiently transfer data between JVM and Python processes. This currently is most beneficial to Python users that work with Pandas/NumPy data. In this article, ...

open_in_new View

local_offer pyspark local_offer spark-2-x local_offer spark

visibility 21
thumb_up 0
access_time 2 months ago

This article shows you how to read and write XML files in Spark. Sample XML file Create a sample XML file named test.xml with the following content: <?xml version="1.0"?> <data> <record id="1"> <rid>1</rid> <nam...

open_in_new View

info Follow us on Twitter to get the latest article updates. Follow us