Add JARs to a Spark Job

Raymond Tang Raymond Tang 1 9416 5.96 index 2/28/2021

Java libraries can be referenced by Spark applications. Once application is built, spark-submit command is called to submit the application to run in a Spark environment.

Use --jars option

To add JARs to a Spark job, --jars option can be used to include JARs on Spark driver and executor classpaths. If multiple JAR files need to be included, use comma to separate them.

The following is an example:

spark-submit --jars /path/to/jar/file1,/path/to/jar/file2 ...

Use --packages option

For option --packages, it is used to pass comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Spark will search the local maven repo, then maven central and any additional remote repositories given by option --repositories. The format of package should be groupId:artifactId:version.

For example, the following command will add koalas package as a dependency:

spark-submit --packages com.latentview.koalas:koalas:0.0.1-beta

If this package is not available in local Maven repositories, Spark will download from maven central thus access to network is required, which might be a limit in some enterprise environment. If you have internal repositories, you can specify via --repositories option.

Add dynamically when constructing Spark session

Another approach is to add the dependencies dynamically when constructing Spark session.

The following example add SQL Server JDBC driver package into driverclass path. If you want to also add it to executor classpath, you can use property spark.executor.extraClassPath.

from pyspark import SparkContext, SparkConf, SQLContext

appName = "PySpark SQL Server Example - via JDBC"
master = "local"
conf = SparkConf() \
    .setAppName(appName) \
    .setMaster(master) \
    .set("spark.driver.extraClassPath","sqljdbc_7.2/enu/mssql-jdbc-7.2.1.jre8.jar")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
spark = sqlContext.sparkSession

The above example can also be replaced using command line:

spark-submit --driver-class-path sqljdbc_7.2/enu/mssql-jdbc-7.2.1.jre8.jar ...

Alternatively, SparkContext.addJarfunction can be used to add JAR into Spark session.

spark

Join the Discussion

View or add your thoughts below

Comments