PySpark - Read Data from Oracle Database
insights Stats
Apache Spark installation guides, performance tuning tips, general tutorials, etc.
*Spark logo is a registered trademark of Apache Spark.
Spark provides different approaches to load data from relational databases like Oracle. We can use Python APIs to read from Oracle using JayDeBeApi (JDBC), Oracle Python driver, ODBC and other supported drivers. Alternatively, we can directly use Spark DataFrameReader.read
API with format 'jdbc'.
This article provides an example of using JDBC directly in PySpark. The same approach can be applied to other relational databases like MySQL, PostgreSQL, SQL Server, etc.
Prerequisites
PySpark environment
You can install Spark on you Windows or Linux machine by following this article: Install Spark 3.2.1 on Linux or WSL. For macOS, follow this one: Apache Spark 3.0.1 Installation on macOS.
For testing the sample script, you can also just use PySpark package directly without doing Spark configurations:
pip install pyspark
For Anaconda environment, you can also install PySpark using the following command:
conda install pyspark
Oracle JDBC package
Oracle JDBC driver ojdbc can be downloaded from Maven Central: Maven Repository: com.oracle.database.jdbc » ojdbc8 » 21.5.0.0. It works with JDK8, JDK11, JDK12, JDK13, JDK14 and JDK15. The license information can be found here.
Download this jar file (ojdbc8-21.5.0.0.jar) into your PySpark project folder. We will use it when submit Spark job:
spark-submit --jars ojdbc8-21.5.0.0.jar ...
Read from Oracle database
Now we can create a PySpark script (oracle-example.py) to load data from Oracle database as DataFrame.
# oracle-example.py from pyspark.sql import SparkSession appName = "PySpark Example - Oracle Example" master = "local" # Create Spark session spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .getOrCreate() sql = "select * from myschema.mytable" user = "username" password = "userpassword" # Change this to your Oracle's details accordingly server = "oraclesever" port = 1521 service_name = 'ORCL' jdbcUrl = f"jdbc:oracle:thin:@{server}:{port}:{service_name}" jdbcDriver = "oracle.jdbc.driver.OracleDriver" # Create a data frame by reading data from Oracle via JDBC jdbcDF = spark.read.format("jdbc") \ .option("url", jdbcUrl) \ .option("query", sql) \ .option("user", user) \ .option("password", password) \ .option("driver", jdbcDriver) \ .load() jdbcDF.show()
spark-submit --jars ojdbc8-21.5.0.0.jar oracle-example.py
Argument --jars
will add the jar package to both driver and executor containers. We used the thin driver thus Oracle native client is not required.
References
Save DataFrame to SQL Databases via JDBC in PySpark
Python: Read Data from Oracle Database
person Naga vaibhav access_time 2 years ago
I updated the string with a slash instead of colon. That works fine.
jdbc:oracle:thin:@{server}:{port}/{service_name}
I updated the string with a slash instead of colon. That works fine.
jdbc:oracle:thin:@{server}:{port}/{service_name}
I tried this out and spark errored out saying: ORA-12505, TNS:listener does not currently know of SID given in connect descriptor.
I have service name to provide on the connection string, how do I provide that on the connection string?
I'm glad it works with service name. Usually that error is due to the wrong service name used in connection string or SID not configured in the Oracle server side.