About 12 months ago, I shared an article about reading and writing XML files in Spark using Python. For many companies, Scala is still preferred for better performance and also to utilize full features that Spark offers. Thus, this article will provide examples about how to load XML file as Spark DataFrame using Scala as programming language.
Sample XML file
The sample input XML file is from my previous article.
<?xml version="1.0"?> <data> <record id="1"> <rid>1</rid> <name>Record 1</name> </record> <record id="2"> <rid>2</rid> <name>Record 2</name> </record> <record id="3"> <rid>3</rid> <name>Record 3</name> </record> </data>
The file is stored locally in my computer and you can also ingest it into HDFS as necessary.
Please following my previous Python article to download spark-xml Java lib.
The version I am going to use is still spark-xml_2.12-0.6.0.jar (for Scala 2.12).
First, let's add this JAR when starting Spark-Shell:
scala> :require spark-xml_2.12-0.11.0.jar Added 'F:\big-data\spark-xml_2.12-0.11.0.jar' to classpath.
*Note - my Spark-Shell is started from F:\big-data folder.
Then we can directly use SparkSession.read API to read from XML file.
scala> spark.read.format("com.databricks.spark.xml").option("rowTag","record").load("file:///F:\\big-data\\test.xml").show() +---+--------+ |_id| _name| +---+--------+ | 1|Record 1| | 2|Record 2| | 3|Record 3| +---+--------+
You can change path to HDFS, Azure Blob Storage, AWS S3, GCS and other storages accordingly.
The code is almost identical to PySpark version as the API names are consistent.