Spark Read from SQL Server Source using Windows/Kerberos Authentication

event 2020-02-03 visibility 6,509 comment 6 insights
more_vert
insights Stats
Spark Read from SQL Server Source using Windows/Kerberos Authentication
Raymond Raymond Spark & PySpark

Apache Spark installation guides, performance tuning tips, general tutorials, etc.

*Spark logo is a registered trademark of Apache Spark.


In this article, I am going to show you how to use JDBC Kerberos authentication to connect to SQL Server sources in Spark (PySpark). I will use Kerberos connection with principal names and password directly that requires Microsoft JDBC Driver 6.2 or above. The sample code can run on Windows, Linux and Mac-OS platforms. 

If you want to work with lower version of Microsft JDBC driver, I already explained in details about the different approaches to authenticate with SQL Server database in my previous articles.  Refer to the following links for more information:

Connect to SQL Server in Spark (PySpark) - Kontext

JDBC Integrated Security, NTLM and Kerberos Authentication for SQL Server


Prerequisites

Microsoft JDBC Driver 6.2 +

Download JDBC driver (6.2. or above) from the following location:

Download Microsoft JDBC Driver for SQL Server

I use JDBC driver 7.4.1 with JRE8. Once downloaded, install the exe or extract the zip file to a directory. 

Spark 2.x

I'm using SparkSession object in the following sample code which is only available since Spark 2.x.

Sample code

We just need to ensure JDBC URL is in the following format:

jdbc:sqlserver://servername=server_name;integratedSecurity=true;authenticationScheme=JavaKerberos;userName=user@REALM;password=****
For userName and password we can also specify it in SparkSession.read APIs.

Complete sample

Please create a PySpark script file pypark-sql-jdbc-kerberos.py with the following content:

from pyspark.sql import SparkSession

# Create spark session
spark = SparkSession.builder \
    .appName('Spark - SQL Server Kerberos Integrated Authentication Example') \
    .master('local') \
    .getOrCreate()

# Change this to your LANID in your organization domain
user = "raymond"
# Change this to your ogranization domain
domain = "GLOBAL.KONTEXT.TECH"
# Change to your LANID password
pwd = "password"

# Change to your own JDBC URL accordingly especially the server address and port
jdbc_url = """jdbc:sqlserver://MSSQL.GLOBAL.KONTEXT.TECH:1433;
integratedSecurity=true;authenticationScheme=JavaKerberos;
"""

# Change to your own SQL select statement
sql = """
SELECT * FROM sys.tables
"""

# Now read data
df = spark.read \
    .format('jdbc') \
    .option('driver','com.microsoft.sqlserver.jdbc.SQLServerDriver') \
    .option('url',jdbc_url) \
    .option('dbtable', '({sql} as src)'.format(sql=sql)) \
    .option('user','{user}@{domain}'.format(user=user,domain=domain)) \
    .option('password', pwd) \
    .load()

df.show()

Remember change the variables accordingly to match with your own environment.

Run the code

Before we run the code, we need to ensure JDBC driver is passed in accordingly. Use the following command to run the application:
spark-submit --jars sqljdbc_7.4\enu\mssql-jdbc-7.4.1.jre8.jar pypark-sql-jdbc-kerberos.py
Make sure jar file path matches the JDBC driver file location of your environment. There can be multiple jar files for different JRE. In the above command, I am using the version for JRE8.

Reference

Refer to the following page about more information regards to Kerberos integrated authentication: 

Using Kerberos integrated authentication to connect to SQL Server

Let me know if you have any questions. 

More from Kontext
comment Comments
Raymond Raymond #1464 access_time 4 years ago more_vert
Sorry for the late reply. I've been busy recently and I have not got an environment with Spark and AD integrated. I will update you once I have that configured. 
format_quote

person venkatesan access_time 4 years ago

We are using Spark 2.x. but the keytab was included in spark3.x could you please share the article to connect with SQL from the kerberos enabled spark cluster using kerberos authentication..

V venkatesan G.K #461 access_time 4 years ago more_vert

We are using Spark 2.x. but the keytab was included in spark3.x could you please share the article to connect with SQL from the kerberos enabled spark cluster using kerberos authentication..

format_quote

person Raymond access_time 4 years ago

For Kerberos enabled Spark cluster, it is usually used to authenticate with other Hadoop services like HDFS, Hive, HBase, etc. Access tokens will be retrieved from those services to be used in Spark application. There might be a way to use built-in functions to reuse but I am not familiar with those details.

To use Kerberos authentication to read data from SQL Server via keytab, you can pass in the keytab and principal parameters:

  • keytab Location of the kerberos keytab file (which must be pre-uploaded to all nodes either by --files option of spark-submit or manually) for the JDBC client. When path information found then Spark considers the keytab distributed manually, otherwise --files assumed. If both keytab and principal are defined then Spark tries to do kerberos authentication.
  • principal Specifies kerberos principal name for the JDBC client. If both keytab and principal are defined then Spark tries to do kerberos authentication.

Please follow this article find out more details: JDBC To Other Databases - Spark 3.1.1 Documentation (apache.org).

If you still could not work out the solution, I can find time to write a dedicated article for this.

Raymond Raymond #460 access_time 4 years ago more_vert

For Kerberos enabled Spark cluster, it is usually used to authenticate with other Hadoop services like HDFS, Hive, HBase, etc. Access tokens will be retrieved from those services to be used in Spark application. There might be a way to use built-in functions to reuse but I am not familiar with those details.

To use Kerberos authentication to read data from SQL Server via keytab, you can pass in the keytab and principal parameters:

  • keytab Location of the kerberos keytab file (which must be pre-uploaded to all nodes either by --files option of spark-submit or manually) for the JDBC client. When path information found then Spark considers the keytab distributed manually, otherwise --files assumed. If both keytab and principal are defined then Spark tries to do kerberos authentication.
  • principal Specifies kerberos principal name for the JDBC client. If both keytab and principal are defined then Spark tries to do kerberos authentication.

Please follow this article find out more details: JDBC To Other Databases - Spark 3.1.1 Documentation (apache.org).

If you still could not work out the solution, I can find time to write a dedicated article for this.

format_quote

person venkatesan access_time 4 years ago

How this will work for kerberos enabled spark cluster.

Did you implement the ticket cache creation in python...? Please share it for reference.


V venkatesan G.K #459 access_time 4 years ago more_vert

How this will work for kerberos enabled spark cluster.

Did you implement the ticket cache creation in python...? Please share it for reference.


format_quote

person Raymond access_time 4 years ago

Hello,

The complete code is already provided here: Spark Read from SQL Server Source using Windows/Kerberos Authentication

The example code uses latest SQL Server JDBC driver which doesn't require keytab. Refer to the following article about how to generate Kerberos ticket using keytab (it also shows an example of generating that using Java programmatically):

Java Kerberos Authentication Configuration Sample & SQL Server Connection Practice



Raymond Raymond #458 access_time 4 years ago more_vert

Hello,

The complete code is already provided here: Spark Read from SQL Server Source using Windows/Kerberos Authentication

The example code uses latest SQL Server JDBC driver which doesn't require keytab. Refer to the following article about how to generate Kerberos ticket using keytab (it also shows an example of generating that using Java programmatically):

Java Kerberos Authentication Configuration Sample & SQL Server Connection Practice



format_quote

person venkatesan access_time 4 years ago

Can you let me know how we set keytab location in the script.

Can you share the completed code..


V venkatesan G.K #457 access_time 4 years ago more_vert

Can you let me know how we set keytab location in the script.

Can you share the completed code..


Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts