Read and Write XML Files with Python

XML is a commonly used data exchange format in many applications and systems though JSON became more popular nowadays. Compared with JSON, XML supports schema (XSD) validation and can be easily transformed other formats using XSLT. XHTML is also a strict version of HTML that is XML based and used in many websites.

This article provides some examples of using Python to read and write XML files.

Example XML file

Create a sample XML file named test.xmlwith the following content:

<?xml version="1.0"?><data>    <record id="1">        <rid>1</rid>        <name>Record 1</name>    </record>    <record id="2">        <rid>2</rid>        <name>Record 2</name>    </record>    <record id="3">        <rid>3</rid>        <name>Record 3</name>    </record></data>

Use XML DOM model

As many other programming languages, XML DOM is commonly used to parse and to manipulate XML files.

More than decades ago when I started coding with C#, XmlDocumentwas a commonly used class to manipulate XML data. In Python, there are also DOM model implementations like package minidomeven though it is a minimal implementation of the Document Object Model interface.

Read XML document

The following code snippet reads the attributes from the document. It first creates a DOM object and then finds all the record elements from document root element. For each element, is parsed as a dictionary record. When parsing record element, several objects are used: Attr, Text and Element. All these elements are inherited from Node class.

from xml.dom.minidom import parse
import os
import pandas as pd

dir_path = os.path.dirname(os.path.realpath(__file__))

data_records = []

with parse(f'{dir_path}/test.xml') as xml_doc:
    root = xml_doc.documentElement
    records = root.getElementsByTagName('record')

    for r in records:
        data = {}
        id_node = r.getAttributeNode('id')
        data['id'] = id_node.value
        el_rid = r.getElementsByTagName('rid')[0]
        data['rid'] = el_rid.firstChild.data
        el_name = r.getElementsByTagName('name')[0]
        data['name'] = el_name.firstChild.data
        data_records.append(data)
print(data_records)
df = pd.DataFrame(data_records)
print(df)

Output:

[{'id': '1', 'rid': '1', 'name': 'Record 1'}, {'id': '2', 'rid': '2', 'name': 'Record 2'}, {'id': '3', 'rid': '3', 'name': 'Record 3'}]
  id rid      name
0  1   1  Record 1
1  2   2  Record 2
2  3   3  Record 3

Write XML document

We can also use DOM object to write XML data in memory to files.

The following code snippet adds a new attribute for each record element in the previous example and then save the new XML document to a new file.

with open(f"{dir_path}/test_new.xml", "w") as new_xml_handle:
    with parse(f'{dir_path}/test.xml') as xml_doc:
        root = xml_doc.documentElement
        records = root.getElementsByTagName('record')

        for r in records:
            r.setAttribute('testAttr', 'New Attribute')
        xml_doc.writexml(new_xml_handle)

The new filetext_new.xml has the following content after the script is run:

<?xml version="1.0" ?><data>
    <record id="1" testAttr="New Attribute">
        <rid>1</rid>
        <name>Record 1</name>
    </record>
    <record id="2" testAttr="New Attribute">
        <rid>2</rid>
        <name>Record 2</name>
    </record>
    <record id="3" testAttr="New Attribute">
        <rid>3</rid>
        <name>Record 3</name>
    </record>
</data>

As you can see, attribute node testAttris added for each element.

Use xml.etree.ElementTree

Another approach is to use xml.etree.ElementTree to read and write XML files.

Read XML file using ElementTree

import xml.etree.ElementTree as ET
import os
import pandas as pd

dir_path = os.path.dirname(os.path.realpath(__file__))

data_records = []

tree = ET.parse(f'{dir_path}/test.xml')
root = tree.getroot()
for r in root.getchildren():
    data = {}
    data['id'] = r.attrib['id']
    el_rid = r.find('rid')
    data['rid'] = el_rid.text
    el_name = r.find('name')
    data['name'] = el_name.text
    data_records.append(data)
print(data_records)
df = pd.DataFrame(data_records)
print(df)

The above scripts first create ElementTree object and then find all 'record' elements through the root element. For each 'record' element, it parses the attributes and child elements. The APIs are very similar to the minidom one but is easier to use.

The output looks like the following:

[{'id': '1', 'rid': '1', 'name': 'Record 1'}, {'id': '2', 'rid': '2', 'name': 'Record 2'}, {'id': '3', 'rid': '3', 'name': 'Record 3'}]
  id rid      name
0  1   1  Record 1
1  2   2  Record 2
2  3   3  Record 3

Write XML file using ElementTree

To write XML file we can just call the writefunction.

Example code:

for r in root.getchildren():
    r.set('testAttr', 'New Attribute2')
tree.write(f'{dir_path}/test_new_2.xml', xml_declaration=True)

Again, the API is simpler compared with minidom. The content of the newly generated file text_new_2.xmlis like the following:

<?xml version='1.0' encoding='us-ascii'?>
<data>
    <record id="1" testAttr="New Attribute2">
        <rid>1</rid>
        <name>Record 1</name>
    </record>
    <record id="2" testAttr="New Attribute2">
        <rid>2</rid>
        <name>Record 2</name>
    </record>
    <record id="3" testAttr="New Attribute2">
        <rid>3</rid>
        <name>Record 3</name>
    </record>
</data>

For ElementTree.write function, you can specify many optional arguments, for example, encoding, XML declaration, etc.

warning Warning: The xml.etree.ElementTree module is not secure against maliciously constructed data. If you need to parse untrusted or unauthenticated data see XML vulnerabilities.

Summary

There are many other libraries available in Python to allow you to parse and write XML files. For many of these packages, they are not as fluent or complete as Java or .NET equivalent libraries. ElementTree is the closest one I found so far.

References

Read XML Files as Pandas DataFrame