Add JARs to a Spark Job

Raymond Raymond event 2021-02-20 visibility 9,166
more_vert

Java libraries can be referenced by Spark applications. Once application is built, spark-submit command is called to submit the application to run in a Spark environment. 

Use --jars option

To add JARs to a Spark job, --jars option can be used to include JARs on Spark driver and executor classpaths. If multiple JAR files need to be included, use comma to separate them.

The following is an example:

spark-submit --jars /path/to/jar/file1,/path/to/jar/file2 ...

Use --packages option

For option --packages, it is used to pass comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Spark will search the local maven repo, then maven central and any additional remote repositories given by option --repositories. The format of package should be groupId:artifactId:version.

For example, the following command will add koalas package as a dependency:

spark-submit --packages com.latentview.koalas:koalas:0.0.1-beta

If this package is not available in local Maven repositories, Spark will download from maven central thus access to network is required, which might be a limit in some enterprise environment. If you have internal repositories, you can specify via --repositories option.

Add dynamically when constructing Spark session

Another approach is to add the dependencies dynamically when constructing Spark session.

The following example add SQL Server JDBC driver package into driver class path. If you want to also add it to executor classpath, you can use property spark.executor.extraClassPath.

from pyspark import SparkContext, SparkConf, SQLContext

appName = "PySpark SQL Server Example - via JDBC"
master = "local"
conf = SparkConf() \
    .setAppName(appName) \
    .setMaster(master) \
    .set("spark.driver.extraClassPath","sqljdbc_7.2/enu/mssql-jdbc-7.2.1.jre8.jar")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
spark = sqlContext.sparkSession

The above example can also be replaced using command line:

spark-submit --driver-class-path sqljdbc_7.2/enu/mssql-jdbc-7.2.1.jre8.jar ...

 Alternatively, SparkContext.addJar function can be used to add JAR into Spark session.

More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts