Read and Write XML Files with Python
XML is a commonly used data exchange format in many applications and systems though JSON became more popular nowadays. Compared with JSON, XML supports schema (XSD) validation and can be easily transformed other formats using XSLT. XHTML is also a strict version of HTML that is XML based and used in many websites.
This article provides some examples of using Python to read and write XML files.
Example XML file
Create a sample XML file named test.xml with the following content:
<?xml version="1.0"?>
<data>
<record id="1">
<rid>1</rid>
<name>Record 1</name>
</record>
<record id="2">
<rid>2</rid>
<name>Record 2</name>
</record>
<record id="3">
<rid>3</rid>
<name>Record 3</name>
</record>
</data>
Use XML DOM model
As many other programming languages, XML DOM is commonly used to parse and to manipulate XML files.
More than decades ago when I started coding with C#, XmlDocument was a commonly used class to manipulate XML data. In Python, there are also DOM model implementations like package minidom even though it is a minimal implementation of the Document Object Model interface.
Read XML document
The following code snippet reads the attributes from the document. It first creates a DOM object and then finds all the record elements from document root element. For each element, is parsed as a dictionary record. When parsing record element, several objects are used: Attr, Text and Element. All these elements are inherited from Node class.
from xml.dom.minidom import parse import os import pandas as pd dir_path = os.path.dirname(os.path.realpath(__file__)) data_records = [] with parse(f'{dir_path}/test.xml') as xml_doc: root = xml_doc.documentElement records = root.getElementsByTagName('record') for r in records: data = {} id_node = r.getAttributeNode('id') data['id'] = id_node.value el_rid = r.getElementsByTagName('rid')[0] data['rid'] = el_rid.firstChild.data el_name = r.getElementsByTagName('name')[0] data['name'] = el_name.firstChild.data data_records.append(data) print(data_records) df = pd.DataFrame(data_records) print(df)
[{'id': '1', 'rid': '1', 'name': 'Record 1'}, {'id': '2', 'rid': '2', 'name': 'Record 2'}, {'id': '3', 'rid': '3', 'name': 'Record 3'}] id rid name 0 1 1 Record 1 1 2 2 Record 2 2 3 3 Record 3
Write XML document
We can also use DOM object to write XML data in memory to files.
The following code snippet adds a new attribute for each record element in the previous example and then save the new XML document to a new file.
with open(f"{dir_path}/test_new.xml", "w") as new_xml_handle: with parse(f'{dir_path}/test.xml') as xml_doc: root = xml_doc.documentElement records = root.getElementsByTagName('record') for r in records: r.setAttribute('testAttr', 'New Attribute') xml_doc.writexml(new_xml_handle)
<?xml version="1.0" ?><data> <record id="1" testAttr="New Attribute"> <rid>1</rid> <name>Record 1</name> </record> <record id="2" testAttr="New Attribute"> <rid>2</rid> <name>Record 2</name> </record> <record id="3" testAttr="New Attribute"> <rid>3</rid> <name>Record 3</name> </record> </data>
As you can see, attribute node testAttr is added for each element.
Use xml.etree.ElementTree
Another approach is to use xml.etree.ElementTree to read and write XML files.
Read XML file using ElementTree
import xml.etree.ElementTree as ET import os import pandas as pd dir_path = os.path.dirname(os.path.realpath(__file__)) data_records = [] tree = ET.parse(f'{dir_path}/test.xml') root = tree.getroot() for r in root.getchildren(): data = {} data['id'] = r.attrib['id'] el_rid = r.find('rid') data['rid'] = el_rid.text el_name = r.find('name') data['name'] = el_name.text data_records.append(data) print(data_records) df = pd.DataFrame(data_records) print(df)
The above scripts first create ElementTree object and then find all 'record' elements through the root element. For each 'record' element, it parses the attributes and child elements. The APIs are very similar to the minidom one but is easier to use.
The output looks like the following:
[{'id': '1', 'rid': '1', 'name': 'Record 1'}, {'id': '2', 'rid': '2', 'name': 'Record 2'}, {'id': '3', 'rid': '3', 'name': 'Record 3'}] id rid name 0 1 1 Record 1 1 2 2 Record 2 2 3 3 Record 3
Write XML file using ElementTree
To write XML file we can just call the write function.
Example code:
for r in root.getchildren(): r.set('testAttr', 'New Attribute2') tree.write(f'{dir_path}/test_new_2.xml', xml_declaration=True)
Again, the API is simpler compared with minidom. The content of the newly generated file text_new_2.xml is like the following:
<?xml version='1.0' encoding='us-ascii'?> <data> <record id="1" testAttr="New Attribute2"> <rid>1</rid> <name>Record 1</name> </record> <record id="2" testAttr="New Attribute2"> <rid>2</rid> <name>Record 2</name> </record> <record id="3" testAttr="New Attribute2"> <rid>3</rid> <name>Record 3</name> </record> </data>
For ElementTree.write function, you can specify many optional arguments, for example, encoding, XML declaration, etc.
Summary
There are many other libraries available in Python to allow you to parse and write XML files. For many of these packages, they are not as fluent or complete as Java or .NET equivalent libraries. ElementTree is the closest one I found so far.
References
Read XML Files as Pandas DataFrame