Spark Scala: Read XML File as DataFrame

Raymond Raymond event 2020-12-16 visibility 4,302
more_vert

About 12 months ago, I shared an article about reading and writing XML files in Spark using Python. For many companies, Scala is still preferred for better performance and also to utilize full features that Spark offers.  Thus, this article will provide examples about how to load XML file as Spark DataFrame using Scala as programming language.

Sample XML file

The sample input XML file is from my previous article. 

test.xml

<?xml version="1.0"?>
<data>
    <record id="1">
        <rid>1</rid>
        <name>Record 1</name>
    </record>
    <record id="2">
        <rid>2</rid>
        <name>Record 2</name>
    </record>
    <record id="3">
        <rid>3</rid>
        <name>Record 3</name>
    </record>
</data>

The file is stored locally in my computer and you can also ingest it into HDFS as necessary. 

Spark-XML library

Please following my previous Python article to download spark-xml Java lib. 

The version I am going to use is still spark-xml_2.12-0.6.0.jar (for Scala 2.12). 

infoCredits to Databricks and other open source contributors to this XML package for Spark XML data source processing.

Code snippet

infoThe following code snippet is provided to use in Spark-Shell. You can also create a Scala file and then use spark-submit command to run the script similar as the PySpark example. JAR file can be added in the submit command or specified when initiating SparkSession

First, let's add this JAR when starting Spark-Shell:

scala> :require spark-xml_2.12-0.11.0.jar
Added 'F:\big-data\spark-xml_2.12-0.11.0.jar' to classpath.

*Note - my Spark-Shell is started from F:\big-data folder. 
Then we can directly use SparkSession.read API to read from XML file.

scala> spark.read.format("com.databricks.spark.xml").option("rowTag","record").load("file:///F:\\big-data\\test.xml").show()
+---+--------+
|_id|   _name|
+---+--------+
|  1|Record 1|
|  2|Record 2|
|  3|Record 3|
+---+--------+

You can change path to HDFS, Azure Blob Storage, AWS S3, GCS and other storages accordingly. 

The code is almost identical to PySpark version as the API names are consistent.

More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts