Read and Write XML files in PySpark

access_time 2 years ago visibility8029 comment 0

This article shows you how to read and write XML files in Spark.

Sample XML file

Create a sample XML file named test.xml with the following content:

<?xml version="1.0"?>
<data>
    <record id="1">
        <rid>1</rid>
        <name>Record 1</name>
    </record>
    <record id="2">
        <rid>2</rid>
        <name>Record 2</name>
    </record>
    <record id="3">
        <rid>3</rid>
        <name>Record 3</name>
    </record>
</data>

Dependent library

For more information, refer to the following repo in GitHub. 

spark-xml

You can download this package directly from Maven repository: https://mvnrepository.com/artifact/com.databricks/spark-xml.

Make sure this package exists in your Spark environment.  Alternatively you can pass in this package as parameter when running Spark job using spark-submit or pyspark command.  For example:

spark-submit --jars spark-xml_2.12-0.6.0.jar ...

Error debug

You may encounter the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o44.load.
: java.lang.BootstrapMethodError: java.lang.NoClassDefFoundError: scala/runtime/java8/JFunction0$mcD$sp
        at com.databricks.spark.xml.XmlOptions.<init>(XmlOptions.scala:36)
        at com.databricks.spark.xml.XmlOptions$.apply(XmlOptions.scala:65)
        at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:66)
        at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:29)

This occurred because Scala version is not matching with spark-xml dependency version.

For example, spark-xml_2.12-0.6.0.jar depends on Scala version 2.12.8. For example, you can change to a different version of Spark XML package. 

spark-submit --jars spark-xml_2.11-0.4.1.jar ...

Read XML file

Remember to change your file location accordingly. 

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from decimal import Decimal
appName = "Python Example - PySpark Read XML"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

schema = StructType([
    StructField('_id', IntegerType(), False),
    StructField('rid', IntegerType(), False),
    StructField('name', StringType(), False)
])

df = spark.read.format("com.databricks.spark.xml") \
    .option("rowTag","record").load("file:///home/tangr/python-examples/test.xml", schema=schema)

df.show()

Output

The attribute is converted to column _${AttributeName} (with prefix _) while the child element is converted to column.

+---+---+--------+
|_id|rid|    name|
+---+---+--------+
|  1|  1|Record 1|
|  2|  2|Record 2|
|  3|  3|Record 3|
+---+---+--------+

Write XML file

df.select("rid","name").write.format("com.databricks.spark.xml").option("rootTag", "data").option("rowTag", "record").mode(
    "overwrite").save('file:///home/tangr/python-examples/test2.xml')

Files are saved as partition files based on your parallelism setup in Spark session.

Output

<data>
    <record>
        <rid>1</rid>
        <name>Record 1</name>
    </record>
    <record>
        <rid>2</rid>
        <name>Record 2</name>
    </record>
    <record>
        <rid>3</rid>
        <name>Record 3</name>
    </record>
</data>

References

If you want to read single local file using Python, refer to the following article:

Read and Write XML Files with Python

info Last modified by Raymond 2 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

More from Kontext

visibility 3050
thumb_up 0
access_time 6 months ago

CSV is a commonly used data format. Spark provides rich APIs to load files from HDFS as data frame.  This page provides examples about how to load CSV from HDFS using Spark. If you want to read a local CSV file in Python, refer to this page  Python: Load / Read Multiline CSV File   ...

Spark Read from SQL Server Source using Windows/Kerberos Authentication
visibility 1464
thumb_up 0
access_time 12 months ago

In this article, I am going to show you how to use JDBC Kerberos authentication to connect to SQL Server sources in Spark (PySpark). I will use  Kerberos connection with principal names and password directly that requires  Microsoft JDBC Driver 6.2  or above. The sample code can run ...

visibility 1369
thumb_up 0
access_time 6 months ago

This article shows how to 'delete' column from Spark data frame using Python.  Follow article  Convert Python Dictionary List to PySpark DataFrame to construct a dataframe. +----------+---+------+ | Category| ID| Value| +----------+---+------+ |Category A| 1| 12.40| |Category B| ...