PySpark - Read Data from Oracle Database

Kontext Kontext event 2022-06-18 visibility 15,751 comment 3
more_vert

Spark provides different approaches to load data from relational databases like Oracle. We can use Python APIs to read from Oracle using JayDeBeApi (JDBC), Oracle Python driver, ODBC and other supported drivers. Alternatively, we can directly use Spark DataFrameReader.read API with format 'jdbc'. 

This article provides an example of using JDBC directly in PySpark. The same approach can be applied to other relational databases like MySQL, PostgreSQL, SQL Server, etc.

Prerequisites

PySpark environment

You can install Spark on you Windows or Linux machine by following this article: Install Spark 3.2.1 on Linux or WSL. For macOS,  follow this one: Apache Spark 3.0.1 Installation on macOS.

For testing the sample script, you can also just use PySpark package directly without doing Spark configurations:

pip install pyspark

For Anaconda environment, you can also install PySpark using the following command:

conda install pyspark

Oracle JDBC package

Oracle JDBC driver ojdbc can be downloaded from Maven Central: Maven Repository: com.oracle.database.jdbc » ojdbc8 » 21.5.0.0. It works with JDK8, JDK11, JDK12, JDK13, JDK14 and JDK15. The license information can be found here.

Download this jar file (ojdbc8-21.5.0.0.jar) into your PySpark project folder. We will use it when submit Spark job:

spark-submit --jars ojdbc8-21.5.0.0.jar ...

Read from Oracle database

Now we can create a PySpark script (oracle-example.py) to load data from Oracle database as DataFrame.

# oracle-example.py
from pyspark.sql import SparkSession

appName = "PySpark Example - Oracle Example"
master = "local"
# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

sql = "select * from myschema.mytable"
user = "username"
password = "userpassword"
# Change this to your Oracle's details accordingly
server = "oraclesever"
port = 1521
service_name = 'ORCL'
jdbcUrl = f"jdbc:oracle:thin:@{server}:{port}:{service_name}"
jdbcDriver = "oracle.jdbc.driver.OracleDriver"

# Create a data frame by reading data from Oracle via JDBC
jdbcDF = spark.read.format("jdbc") \
    .option("url", jdbcUrl) \
    .option("query", sql) \
    .option("user", user) \
    .option("password", password) \
    .option("driver", jdbcDriver) \
    .load()

jdbcDF.show()
Run the script with the following command:
spark-submit --jars ojdbc8-21.5.0.0.jar oracle-example.py

Argument --jars will add the jar package to both driver and executor containers. We used the thin driver thus Oracle native client is not required.

References

Save DataFrame to SQL Databases via JDBC in PySpark

Python: Read Data from Oracle Database

More from Kontext
comment Comments
NV Naga vaibhav Elluru

Naga vaibhav access_time 2 years ago link more_vert

I updated the string with a slash instead of colon. That works fine.

jdbc:oracle:thin:@{server}:{port}/{service_name}


Raymond Raymond

Raymond access_time 2 years ago link more_vert

I'm glad it works with service name. Usually that error is due to the wrong service name used in connection string or SID not configured in the Oracle server side.

NV Naga vaibhav Elluru

Naga vaibhav access_time 2 years ago link more_vert

I tried this out and spark errored out saying: ORA-12505, TNS:listener does not currently know of SID given in connect descriptor.

I have service name to provide on the connection string, how do I provide that on the connection string?

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts