About 12 months ago, I shared an article about reading and writing XML files in Spark using Python. For many companies, Scala is still preferred for better performance and also to utilize full features that Spark offers. Thus, this article will provide examples about how to load XML file as Spark DataFrameusing Scala as programming language.
Sample XML file
The sample input XML file is from my previous article.
test.xml
<?xml version="1.0"?>
<data>
<record id="1">
<rid>1</rid>
<name>Record 1</name>
</record>
<record id="2">
<rid>2</rid>
<name>Record 2</name>
</record>
<record id="3">
<rid>3</rid>
<name>Record 3</name>
</record>
</data>
The file is stored locally in my computer and you can also ingest it into HDFS as necessary.
Spark-XML library
Please following my previous Python article to download spark-xmlJava lib.
The version I am going to use is still spark-xml_2.12-0.6.0.jar (for Scala 2.12).
infoCredits to Databricks and other open source contributors to this XML package for Spark XML data source processing.
Code snippet
infoThe following code snippet is provided to use in Spark-Shell. You can also create a Scala file and then use spark-submit command to run the script similar as the PySpark example. JAR file can be added in the submit command or specified when initiating SparkSession.
First, let's add this JAR when starting Spark-Shell:
scala> :require spark-xml_2.12-0.11.0.jar
Added 'F:\big-data\spark-xml_2.12-0.11.0.jar' to classpath.
*Note - my Spark-Shell is started from F:\big-data folder. Then we can directly use SparkSession.read API to read from XML file.
scala> spark.read.format("com.databricks.spark.xml").option("rowTag","record").load("file:///F:\\big-data\\test.xml").show()
+---+--------+
|_id| _name|
+---+--------+
| 1|Record 1|
| 2|Record 2|
| 3|Record 3|
+---+--------+
You can change path to HDFS, Azure Blob Storage, AWS S3, GCS and other storages accordingly.
The code is almost identical to PySpark version as the API names are consistent.