Spark Scala: Read XML File as DataFrame

access_time 30 days ago visibility29 comment 0

About 12 months ago, I shared an article about reading and writing XML files in Spark using Python. For many companies, Scala is still preferred for better performance and also to utilize full features that Spark offers.  Thus, this article will provide examples about how to load XML file as Spark DataFrame using Scala as programming language.

Sample XML file

The sample input XML file is from my previous article. 

test.xml

<?xml version="1.0"?>
<data>
    <record id="1">
        <rid>1</rid>
        <name>Record 1</name>
    </record>
    <record id="2">
        <rid>2</rid>
        <name>Record 2</name>
    </record>
    <record id="3">
        <rid>3</rid>
        <name>Record 3</name>
    </record>
</data>

The file is stored locally in my computer and you can also ingest it into HDFS as necessary. 

Spark-XML library

Please following my previous Python article to download spark-xml Java lib. 

The version I am going to use is still spark-xml_2.12-0.6.0.jar (for Scala 2.12). 

infoCredits to Databricks and other open source contributors to this XML package for Spark XML data source processing.

Code snippet

infoThe following code snippet is provided to use in Spark-Shell. You can also create a Scala file and then use spark-submit command to run the script similar as the PySpark example. JAR file can be added in the submit command or specified when initiating SparkSession

First, let's add this JAR when starting Spark-Shell:

scala> :require spark-xml_2.12-0.11.0.jar
Added 'F:\big-data\spark-xml_2.12-0.11.0.jar' to classpath.

*Note - my Spark-Shell is started from F:\big-data folder. 
Then we can directly use SparkSession.read API to read from XML file.

scala> spark.read.format("com.databricks.spark.xml").option("rowTag","record").load("file:///F:\\big-data\\test.xml").show()
+---+--------+
|_id|   _name|
+---+--------+
|  1|Record 1|
|  2|Record 2|
|  3|Record 3|
+---+--------+

You can change path to HDFS, Azure Blob Storage, AWS S3, GCS and other storages accordingly. 

The code is almost identical to PySpark version as the API names are consistent.

info Last modified by Raymond 29 days ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Follow Kontext

Get our latest updates on LinkedIn or Twitter.

Want to publish your article on Kontext?

Learn more

More from Kontext

visibility 1916
thumb_up 0
access_time 2 years ago

I’ve been following Mobius project for a while and have been waiting for this day. .NET for Apache Spark v0.1.0 was just published on 2019-04-25 on GitHub. It provides high performance APIs for programming Apache Spark applications with C# and F#. It is .NET Standard complaint and can run in ...

visibility 838
thumb_up 0
access_time 3 months ago

In Spark, function to_date can be used to convert string to date. This function is available since Spark 1.5.0. SELECT to_date('2020-10-23', 'yyyy-MM-dd'); SELECT to_date('23Oct2020', 'ddMMMyyyy'); Refer to the official documentation about all the datetime patterns.  ...

visibility 52155
thumb_up 14
access_time 2 years ago

Data partitioning is critical to data processing performance especially for large volume of data processing in Spark. Partitions in Spark won’t span across nodes though one node can contains more than one partitions. When processing, Spark assigns one task for each partition and each worker threads ...