Read and Write XML files in PySpark

access_time 9 months ago visibility4589 comment 0

This article shows you how to read and write XML files in Spark.

Sample XML file

Create a sample XML file named test.xml with the following content:

<?xml version="1.0"?>
<data>
    <record id="1">
        <rid>1</rid>
        <name>Record 1</name>
    </record>
    <record id="2">
        <rid>2</rid>
        <name>Record 2</name>
    </record>
    <record id="3">
        <rid>3</rid>
        <name>Record 3</name>
    </record>
</data>

Dependent library

For more information, refer to the following repo in GitHub. 

spark-xml

You can download this package directly from Maven repository: https://mvnrepository.com/artifact/com.databricks/spark-xml.

Make sure this package exists in your Spark environment.  Alternatively you can pass in this package as parameter when running Spark job using spark-submit or pyspark command.  For example:

spark-submit --jars spark-xml_2.12-0.6.0.jar ...

Error debug

You may encounter the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o44.load.
: java.lang.BootstrapMethodError: java.lang.NoClassDefFoundError: scala/runtime/java8/JFunction0$mcD$sp
        at com.databricks.spark.xml.XmlOptions.<init>(XmlOptions.scala:36)
        at com.databricks.spark.xml.XmlOptions$.apply(XmlOptions.scala:65)
        at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:66)
        at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:29)

This occurred because Scala version is not matching with spark-xml dependency version.

For example, spark-xml_2.12-0.6.0.jar depends on Scala version 2.12.8. For example, you can change to a different version of Spark XML package. 

spark-submit --jars spark-xml_2.11-0.4.1.jar ...

Read XML file

Remember to change your file location accordingly. 

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from decimal import Decimal
appName = "Python Example - PySpark Read XML"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

schema = StructType([
    StructField('_id', IntegerType(), False),
    StructField('rid', IntegerType(), False),
    StructField('name', StringType(), False)
])

df = spark.read.format("com.databricks.spark.xml") \
    .option("rowTag","record").load("file:///home/tangr/python-examples/test.xml", schema=schema)

df.show()

Output

The attribute is converted to column _${AttributeName} (with prefix _) while the child element is converted to column.

+---+---+--------+
|_id|rid|    name|
+---+---+--------+
|  1|  1|Record 1|
|  2|  2|Record 2|
|  3|  3|Record 3|
+---+---+--------+

Write XML file

df.select("rid","name").write.format("com.databricks.spark.xml").option("rootTag", "data").option("rowTag", "record").mode(
    "overwrite").save('file:///home/tangr/python-examples/test2.xml')

Files are saved as partition files based on your parallelism setup in Spark session.

Output

<data>
    <record>
        <rid>1</rid>
        <name>Record 1</name>
    </record>
    <record>
        <rid>2</rid>
        <name>Record 2</name>
    </record>
    <record>
        <rid>3</rid>
        <name>Record 3</name>
    </record>
</data>

References

If you want to read single local file using Python, refer to the following article:

Read and Write XML Files with Python

info Last modified by Raymond at 21 days ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Kontext Column

Created for everyone to publish data, programming and cloud related articles.
Follow three steps to create your columns.


Learn more arrow_forward

More from Kontext

local_offer spark local_offer scala local_offer parquet local_offer spark-file-operations

visibility 21006
thumb_up 0
access_time 3 years ago

In this page, I’m going to demonstrate how to write and read parquet files in Spark/Scala by using Spark SQLContext class. Go the following project site to understand more about parquet. https://parquet.apache.org/ If you have not installed Spark, follow this page to setup: Install Big Data ...

local_offer python local_offer spark local_offer pyspark local_offer hive local_offer spark-database-connect

visibility 22445
thumb_up 4
access_time 2 years ago

From Spark 2.0, you can easily read data from Hive data warehouse and also write/append new data to Hive tables. This page shows how to operate with Hive in Spark including: Create DataFrame from existing Hive table Save DataFrame to a new Hive table Append data to the existing Hive table via ...

local_offer python local_offer spark local_offer spark-dataframe

visibility 26932
thumb_up 0
access_time 2 years ago

This post shows how to derive new column in a Spark data frame from a JSON array string column. I am running the code in Spark 2.2.1 though it is compatible with Spark 1.6.0 (with less JSON SQL functions). Refer to the following post to install Spark in Windows. Install Spark 2.2.1 in Windows ...

About column

Code snippets for various programming languages/frameworks.

rss_feed Subscribe RSS