This article shows you how to read and write XML files in Spark.
Sample XML file
Create a sample XML file named test.xml with the following content:
<?xml version="1.0"?>
<data>
<record id="1">
<rid>1</rid>
<name>Record 1</name>
</record>
<record id="2">
<rid>2</rid>
<name>Record 2</name>
</record>
<record id="3">
<rid>3</rid>
<name>Record 3</name>
</record>
</data>
Dependent library
For more information, refer to the following repo in GitHub.
You can download this package directly from Maven repository: https://mvnrepository.com/artifact/com.databricks/spark-xml.
Make sure this package exists in your Spark environment. Alternatively you can pass in this package as parameter when running Spark job using spark-submit or pyspark command. For example:
spark-submit --jars spark-xml_2.12-0.6.0.jar ...
Error debug
You may encounter the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o44.load.
: java.lang.BootstrapMethodError: java.lang.NoClassDefFoundError: scala/runtime/java8/JFunction0$mcD$sp
at com.databricks.spark.xml.XmlOptions.<init>(XmlOptions.scala:36)
at com.databricks.spark.xml.XmlOptions$.apply(XmlOptions.scala:65)
at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:66)
at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:29)
This occurred because Scala version is not matching with spark-xml dependency version.
For example, spark-xml_2.12-0.6.0.jar depends on Scala version 2.12.8. For example, you can change to a different version of Spark XML package.
spark-submit --jars spark-xml_2.11-0.4.1.jar ...
Read XML file
Remember to change your file location accordingly.
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from decimal import Decimal
appName = "Python Example - PySpark Read XML"
master = "local"
# Create Spark session
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.getOrCreate()
schema = StructType([
StructField('_id', IntegerType(), False),
StructField('rid', IntegerType(), False),
StructField('name', StringType(), False)
])
df = spark.read.format("com.databricks.spark.xml") \
.option("rowTag","record").load("file:///home/tangr/python-examples/test.xml", schema=schema)
df.show()
Output
The attribute is converted to column _${AttributeName} (with prefix _) while the child element is converted to column.
+---+---+--------+
|_id|rid| name|
+---+---+--------+
| 1| 1|Record 1|
| 2| 2|Record 2|
| 3| 3|Record 3|
+---+---+--------+
Write XML file
df.select("rid","name").write.format("com.databricks.spark.xml").option("rootTag", "data").option("rowTag", "record").mode(
"overwrite").save('file:///home/tangr/python-examples/test2.xml')
Files are saved as partition files based on your parallelism setup in Spark session.
Output
<data>
<record>
<rid>1</rid>
<name>Record 1</name>
</record>
<record>
<rid>2</rid>
<name>Record 2</name>
</record>
<record>
<rid>3</rid>
<name>Record 3</name>
</record>
</data>
References
If you want to read single local file using Python, refer to the following article: