Read and Write XML files in PySpark
This article shows you how to read and write XML files in Spark.
Sample XML file
Create a sample XML file named test.xml with the following content:
<?xml version="1.0"?> <data> <record id="1"> <rid>1</rid> <name>Record 1</name> </record> <record id="2"> <rid>2</rid> <name>Record 2</name> </record> <record id="3"> <rid>3</rid> <name>Record 3</name> </record> </data>
Dependent library
For more information, refer to the following repo in GitHub.
You can download this package directly from Maven repository: https://mvnrepository.com/artifact/com.databricks/spark-xml.
Make sure this package exists in your Spark environment. Alternatively you can pass in this package as parameter when running Spark job using spark-submit or pyspark command. For example:
spark-submit --jars spark-xml_2.12-0.6.0.jar ...
Error debug
You may encounter the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o44.load. : java.lang.BootstrapMethodError: java.lang.NoClassDefFoundError: scala/runtime/java8/JFunction0$mcD$sp at com.databricks.spark.xml.XmlOptions.<init>(XmlOptions.scala:36) at com.databricks.spark.xml.XmlOptions$.apply(XmlOptions.scala:65) at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:66) at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:29)
This occurred because Scala version is not matching with spark-xml dependency version.
For example, spark-xml_2.12-0.6.0.jar depends on Scala version 2.12.8. For example, you can change to a different version of Spark XML package.
spark-submit --jars spark-xml_2.11-0.4.1.jar ...
Read XML file
Remember to change your file location accordingly.
from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType from decimal import Decimal appName = "Python Example - PySpark Read XML" master = "local" # Create Spark session spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .getOrCreate() schema = StructType([ StructField('_id', IntegerType(), False), StructField('rid', IntegerType(), False), StructField('name', StringType(), False) ]) df = spark.read.format("com.databricks.spark.xml") \ .option("rowTag","record").load("file:///home/tangr/python-examples/test.xml", schema=schema) df.show()
Output
The attribute is converted to column _${AttributeName} (with prefix _) while the child element is converted to column.
+---+---+--------+ |_id|rid| name| +---+---+--------+ | 1| 1|Record 1| | 2| 2|Record 2| | 3| 3|Record 3| +---+---+--------+
Write XML file
df.select("rid","name").write.format("com.databricks.spark.xml").option("rootTag", "data").option("rowTag", "record").mode( "overwrite").save('file:///home/tangr/python-examples/test2.xml')
Files are saved as partition files based on your parallelism setup in Spark session.
Output
<data> <record> <rid>1</rid> <name>Record 1</name> </record> <record> <rid>2</rid> <name>Record 2</name> </record> <record> <rid>3</rid> <name>Record 3</name> </record> </data>
References
If you want to read single local file using Python, refer to the following article:
Read and Write XML Files with Python