By using this site, you acknowledge that you have read and understand our Cookie policy, Privacy policy and Terms .

This article shows you how to read and write XML files in Spark.

Sample XML file

Create a sample XML file named test.xml with the following content:

<?xml version="1.0"?>
<data>
    <record id="1">
        <rid>1</rid>
        <name>Record 1</name>
    </record>
    <record id="2">
        <rid>2</rid>
        <name>Record 2</name>
    </record>
    <record id="3">
        <rid>3</rid>
        <name>Record 3</name>
    </record>
</data>

Dependent library

For more information, refer to the following repo in GitHub. 

spark-xml

You can download this package directly from Maven repository: https://mvnrepository.com/artifact/com.databricks/spark-xml.

Make sure this package exists in your Spark environment.  Alternatively you can pass in this package as parameter when running Spark job using spark-submit or pyspark command.  For example:

spark-submit --jars spark-xml_2.12-0.6.0.jar ...

Error debug

You may encounter the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o44.load.
: java.lang.BootstrapMethodError: java.lang.NoClassDefFoundError: scala/runtime/java8/JFunction0$mcD$sp
        at com.databricks.spark.xml.XmlOptions.<init>(XmlOptions.scala:36)
        at com.databricks.spark.xml.XmlOptions$.apply(XmlOptions.scala:65)
        at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:66)
        at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:29)

This occurred because Scala version is not matching with spark-xml dependency version.

For example, spark-xml_2.12-0.6.0.jar depends on Scala version 2.12.8. For example, you can change to a different version of Spark XML package. 

spark-submit --jars spark-xml_2.11-0.4.1.jar ...

Read XML file

Remember to change your file location accordingly. 

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from decimal import Decimal
appName = "Python Example - PySpark Read XML"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

schema = StructType([
    StructField('_id', IntegerType(), False),
    StructField('rid', IntegerType(), False),
    StructField('name', StringType(), False)
])

df = spark.read.format("com.databricks.spark.xml") \
    .option("rowTag","record").load("file:///home/tangr/python-examples/test.xml", schema=schema)

df.show()

Output

The attribute is converted to column _${AttributeName} (with prefix _) while the child element is converted to column.

+---+---+--------+
|_id|rid|    name|
+---+---+--------+
|  1|  1|Record 1|
|  2|  2|Record 2|
|  3|  3|Record 3|
+---+---+--------+

Write XML file

df.select("rid","name").write.format("com.databricks.spark.xml").option("rootTag", "data").option("rowTag", "record").mode(
    "overwrite").save('file:///home/tangr/python-examples/test2.xml')

Files are saved as partition files based on your parallelism setup in Spark session.

Output

<data>
    <record>
        <rid>1</rid>
        <name>Record 1</name>
    </record>
    <record>
        <rid>2</rid>
        <name>Record 2</name>
    </record>
    <record>
        <rid>3</rid>
        <name>Record 3</name>
    </record>
</data>
info Last modified by Raymond at 4 months ago * This page is subject to Site terms.

More from Kontext

PySpark Read Multiple Lines Records from CSV

local_offer pyspark local_offer spark-2-x local_offer python

visibility 27
thumb_up 0
access_time 9 days ago

CSV is a common format used when extracting and exchanging data between systems and platforms. Once CSV file is ingested into HDFS, you can easily read them as DataFrame in Spark. However there are a few options you need to pay attention to especially if you source file: Has records ac...

open_in_new View open_in_new Spark + PySpark

local_offer pyspark local_offer spark-2-x local_offer teradata local_offer SQL Server

visibility 60
thumb_up 0
access_time 20 days ago

In my previous article about  Connect to SQL Server in Spark (PySpark) , I mentioned the ways t...

open_in_new View open_in_new Spark + PySpark

Spark Read from SQL Server Source using Windows/Kerberos Authentication

local_offer pyspark local_offer SQL Server local_offer spark-2-x

visibility 145
thumb_up 0
access_time 3 months ago

In this article, I am going to show you how to use JDBC Kerberos authentication to connect to SQL Server sources in Spark (PySpark). I will use  Kerberos connection with principal names and password directly that requires  ...

open_in_new View open_in_new Spark + PySpark

Schema Merging (Evolution) with Parquet in Spark and Hive

local_offer parquet local_offer pyspark local_offer spark-2-x local_offer hive local_offer hdfs

visibility 332
thumb_up 0
access_time 3 months ago

Schema evolution is supported by many frameworks or data serialization systems such as Avro, Orc, Protocol Buffer and Parquet. With schema evolution, one set of data can be stored in multiple files with different but compatible schema. In Spark, Parquet data source can detect and merge sch...

open_in_new View open_in_new Spark + PySpark

info About author

Dark theme mode

Dark theme mode is available on Kontext.

Learn more arrow_forward
Kontext Column

Kontext Column

Created for everyone to publish data, programming and cloud related articles. Follow three steps to create your columns.

Learn more arrow_forward
info Follow us on Twitter to get the latest article updates. Follow us