Add JARs to a Spark Job
Java libraries can be referenced by Spark applications. Once application is built, spark-submit command is called to submit the application to run in a Spark environment.
Use --jars option
To add JARs to a Spark job, --jars option can be used to include JARs on Spark driver and executor classpaths. If multiple JAR files need to be included, use comma to separate them.
The following is an example:
spark-submit --jars /path/to/jar/file1,/path/to/jar/file2 ...
Use --packages option
For option --packages, it is used to pass comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Spark will search the local maven repo, then maven central and any additional remote repositories given by option --repositories. The format of package should be groupId:artifactId:version.
For example, the following command will add koalas package as a dependency:
spark-submit --packages com.latentview.koalas:koalas:0.0.1-beta
If this package is not available in local Maven repositories, Spark will download from maven central thus access to network is required, which might be a limit in some enterprise environment. If you have internal repositories, you can specify via --repositories option.
Add dynamically when constructing Spark session
Another approach is to add the dependencies dynamically when constructing Spark session.
The following example add SQL Server JDBC driver package into driver class path. If you want to also add it to executor classpath, you can use property spark.executor.extraClassPath.
from pyspark import SparkContext, SparkConf, SQLContext appName = "PySpark SQL Server Example - via JDBC" master = "local" conf = SparkConf() \ .setAppName(appName) \ .setMaster(master) \ .set("spark.driver.extraClassPath","sqljdbc_7.2/enu/mssql-jdbc-7.2.1.jre8.jar") sc = SparkContext(conf=conf) sqlContext = SQLContext(sc) spark = sqlContext.sparkSession
The above example can also be replaced using command line:
spark-submit --driver-class-path sqljdbc_7.2/enu/mssql-jdbc-7.2.1.jre8.jar ...
Alternatively, SparkContext.addJar function can be used to add JAR into Spark session.