access_time 8 months ago languageEnglish
more_vert

Spark Scala: Read XML File as DataFrame

visibility 947 comment 0

About 12 months ago, I shared an article about reading and writing XML files in Spark using Python. For many companies, Scala is still preferred for better performance and also to utilize full features that Spark offers.  Thus, this article will provide examples about how to load XML file as Spark DataFrame using Scala as programming language.

Sample XML file

The sample input XML file is from my previous article. 

test.xml

<?xml version="1.0"?>
<data>
    <record id="1">
        <rid>1</rid>
        <name>Record 1</name>
    </record>
    <record id="2">
        <rid>2</rid>
        <name>Record 2</name>
    </record>
    <record id="3">
        <rid>3</rid>
        <name>Record 3</name>
    </record>
</data>

The file is stored locally in my computer and you can also ingest it into HDFS as necessary. 

Spark-XML library

Please following my previous Python article to download spark-xml Java lib. 

The version I am going to use is still spark-xml_2.12-0.6.0.jar (for Scala 2.12). 

infoCredits to Databricks and other open source contributors to this XML package for Spark XML data source processing.

Code snippet

infoThe following code snippet is provided to use in Spark-Shell. You can also create a Scala file and then use spark-submit command to run the script similar as the PySpark example. JAR file can be added in the submit command or specified when initiating SparkSession

First, let's add this JAR when starting Spark-Shell:

scala> :require spark-xml_2.12-0.11.0.jar
Added 'F:\big-data\spark-xml_2.12-0.11.0.jar' to classpath.

*Note - my Spark-Shell is started from F:\big-data folder. 
Then we can directly use SparkSession.read API to read from XML file.

scala> spark.read.format("com.databricks.spark.xml").option("rowTag","record").load("file:///F:\\big-data\\test.xml").show()
+---+--------+
|_id|   _name|
+---+--------+
|  1|Record 1|
|  2|Record 2|
|  3|Record 3|
+---+--------+

You can change path to HDFS, Azure Blob Storage, AWS S3, GCS and other storages accordingly. 

The code is almost identical to PySpark version as the API names are consistent.

info Last modified by Raymond 8 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Follow Kontext

Get our latest updates on LinkedIn.

Want to contribute on Kontext to help others?

Learn more

More from Kontext

visibility 1118
thumb_up 0
access_time 8 months ago
Spark - Read from BigQuery Table
visibility 440
thumb_up 0
access_time 5 months ago
Spark 3.0.1: Connect to HBase 2.4.1
visibility 912
thumb_up 1
access_time 7 months ago