PySpark - Read Data from Oracle Database

Spark provides different approaches to load data from relational databases like Oracle. We can use Python APIs to read from Oracle using JayDeBeApi (JDBC), Oracle Python driver, ODBC and other supported drivers. Alternatively, we can directly use Spark DataFrameReader.read API with format 'jdbc'.

This article provides an example of using JDBC directly in PySpark. The same approach can be applied to other relational databases like MySQL, PostgreSQL, SQL Server, etc.

Prerequisites

PySpark environment

You can install Spark on you Windows or Linux machine by following this article: Install Spark 3.2.1 on Linux or WSL. For macOS, follow this one: Apache Spark 3.0.1 Installation on macOS.

For testing the sample script, you can also just use PySpark package directly without doing Spark configurations:

pip install pyspark

For Anaconda environment, you can also install PySpark using the following command:

conda install pyspark

Oracle JDBC package

Oracle JDBC driver ojdbc can be downloaded from Maven Central: Maven Repository: com.oracle.database.jdbc » ojdbc8 » 21.5.0.0. It works with JDK8, JDK11, JDK12, JDK13, JDK14 and JDK15. The license information can be found here.

Download this jar file (ojdbc8-21.5.0.0.jar) into your PySpark project folder. We will use it when submit Spark job:

spark-submit --jars ojdbc8-21.5.0.0.jar ...

Read from Oracle database

Now we can create a PySpark script (oracle-example.py) to load data from Oracle database as DataFrame.

# oracle-example.py
from pyspark.sql import SparkSession

appName = "PySpark Example - Oracle Example"
master = "local"
# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

sql = "select * from myschema.mytable"
user = "username"
password = "userpassword"
# Change this to your Oracle's details accordingly
server = "oraclesever"
port = 1521
service_name = 'ORCL'
jdbcUrl = f"jdbc:oracle:thin:@{server}:{port}:{service_name}"
jdbcDriver = "oracle.jdbc.driver.OracleDriver"

# Create a data frame by reading data from Oracle via JDBC
jdbcDF = spark.read.format("jdbc") \
    .option("url", jdbcUrl) \
    .option("query", sql) \
    .option("user", user) \
    .option("password", password) \
    .option("driver", jdbcDriver) \
    .load()

jdbcDF.show()

Run the script with the following command:

spark-submit --jars ojdbc8-21.5.0.0.jar oracle-example.py

Argument --jars will add the jar package to both driver and executor containers. We used the thin driver thus Oracle native client is not required.

References

Save DataFrame to SQL Databases via JDBC in PySpark

Python: Read Data from Oracle Database