By using this site, you acknowledge that you have read and understand our Cookie policy, Privacy policy and Terms .
close

Code snippets for various programming languages/frameworks.

rss_feed Subscribe RSS

This article shows you how to read and write XML files in Spark.

Sample XML file

Create a sample XML file named test.xml with the following content:

<?xml version="1.0"?>
<data>
    <record id="1">
        <rid>1</rid>
        <name>Record 1</name>
    </record>
    <record id="2">
        <rid>2</rid>
        <name>Record 2</name>
    </record>
    <record id="3">
        <rid>3</rid>
        <name>Record 3</name>
    </record>
</data>

Dependent library

For more information, refer to the following repo in GitHub. 

spark-xml

You can download this package directly from Maven repository: https://mvnrepository.com/artifact/com.databricks/spark-xml.

Make sure this package exists in your Spark environment.  Alternatively you can pass in this package as parameter when running Spark job using spark-submit or pyspark command.  For example:

spark-submit --jars spark-xml_2.12-0.6.0.jar ...

Error debug

You may encounter the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o44.load.
: java.lang.BootstrapMethodError: java.lang.NoClassDefFoundError: scala/runtime/java8/JFunction0$mcD$sp
        at com.databricks.spark.xml.XmlOptions.<init>(XmlOptions.scala:36)
        at com.databricks.spark.xml.XmlOptions$.apply(XmlOptions.scala:65)
        at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:66)
        at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:29)

This occurred because Scala version is not matching with spark-xml dependency version.

For example, spark-xml_2.12-0.6.0.jar depends on Scala version 2.12.8. For example, you can change to a different version of Spark XML package. 

spark-submit --jars spark-xml_2.11-0.4.1.jar ...

Read XML file

Remember to change your file location accordingly. 

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from decimal import Decimal
appName = "Python Example - PySpark Read XML"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

schema = StructType([
    StructField('_id', IntegerType(), False),
    StructField('rid', IntegerType(), False),
    StructField('name', StringType(), False)
])

df = spark.read.format("com.databricks.spark.xml") \
    .option("rowTag","record").load("file:///home/tangr/python-examples/test.xml", schema=schema)

df.show()

Output

The attribute is converted to column _${AttributeName} (with prefix _) while the child element is converted to column.

+---+---+--------+
|_id|rid|    name|
+---+---+--------+
|  1|  1|Record 1|
|  2|  2|Record 2|
|  3|  3|Record 3|
+---+---+--------+

Write XML file

df.select("rid","name").write.format("com.databricks.spark.xml").option("rootTag", "data").option("rowTag", "record").mode(
    "overwrite").save('file:///home/tangr/python-examples/test2.xml')

Files are saved as partition files based on your parallelism setup in Spark session.

Output

<data>
    <record>
        <rid>1</rid>
        <name>Record 1</name>
    </record>
    <record>
        <rid>2</rid>
        <name>Record 2</name>
    </record>
    <record>
        <rid>3</rid>
        <name>Record 3</name>
    </record>
</data>
info Last modified by Raymond at 1 month ago
info About author

info License/Terms

More from Kontext

local_offer pyspark local_offer spark-2-x local_offer python

visibility 39
comment 0
thumb_up 0
access_time 25 days ago

This articles show you how to convert a Python dictionary list to a Spark DataFrame. The code snippets runs on Spark 2.x environments. Input The input data (dictionary list looks like the following): data = [{"Category": 'Category A', 'ItemID': 1, 'Amount': 12.40}, ...

open_in_new View

Improve PySpark Performance using Pandas UDF with Apache Arrow

local_offer pyspark local_offer spark local_offer spark-2-x local_offer pandas

visibility 133
comment 0
thumb_up 4
access_time 28 days ago

Apache Arrow is an in-memory columnar data format that can be used in Spark to efficiently transfer data between JVM and Python processes. This currently is most beneficial to Python users that work with Pandas/NumPy data. In this article, ...

open_in_new View

local_offer pyspark local_offer spark-2-x local_offer spark local_offer python

visibility 16
comment 0
thumb_up 0
access_time 1 month ago

This article shows how to convert a Python dictionary list to a DataFrame in Spark using Python. Example dictionary list data = [{"Category": 'Category A', "ID": 1, "Value": 12.40}, {"Category": 'Category B', "ID": 2, "Value": 30.10}, {"Category": 'Category C', "...

open_in_new View

local_offer pyspark local_offer spark-2-x local_offer spark

visibility 18
comment 0
thumb_up 0
access_time 2 months ago

Sometime it is necessary to pass environment variables to Spark executors. To pass environment variable to executors, use setExecutorEnv function of SparkConf class. Code snippet In the following code snippet, an environment variable name ENV_NAME is set up with value ...

open_in_new View