Read and Write XML Files with Python

access_time 14 days ago visibility55 comment 0

XML is a commonly used data exchange format in many applications and systems though JSON became more popular nowadays. Compared with JSON, XML supports schema (XSD) validation and can be easily transformed other formats using XSLT. XHTML is also a strict version of HTML that is XML based and used in many websites.

This article provides some examples of using Python to read and write XML files.

Example XML file

Create a sample XML file named test.xml with the following content:

<?xml version="1.0"?>
<data>
    <record id="1">
        <rid>1</rid>
        <name>Record 1</name>
    </record>
    <record id="2">
        <rid>2</rid>
        <name>Record 2</name>
    </record>
    <record id="3">
        <rid>3</rid>
        <name>Record 3</name>
    </record>
</data>

Use XML DOM model

As many other programming languages, XML DOM is commonly used to parse and to manipulate XML files. 

More than decades ago when I started coding with C#, XmlDocument was a commonly used class to manipulate XML data. In Python, there are also DOM model implementations like package minidom even though it is a minimal implementation of the Document Object Model interface.

Read XML document

The following code snippet reads the attributes from the document. It first creates a DOM object and then finds all the record elements from document root element. For each element, is parsed as a dictionary record. When parsing record element, several objects are used: Attr, Text and Element. All these elements are inherited from Node class. 

from xml.dom.minidom import parse
import os
import pandas as pd

dir_path = os.path.dirname(os.path.realpath(__file__))

data_records = []

with parse(f'{dir_path}/test.xml') as xml_doc:
    root = xml_doc.documentElement
    records = root.getElementsByTagName('record')

    for r in records:
        data = {}
        id_node = r.getAttributeNode('id')
        data['id'] = id_node.value
        el_rid = r.getElementsByTagName('rid')[0]
        data['rid'] = el_rid.firstChild.data
        el_name = r.getElementsByTagName('name')[0]
        data['name'] = el_name.firstChild.data
        data_records.append(data)
print(data_records)
df = pd.DataFrame(data_records)
print(df)
Output:
[{'id': '1', 'rid': '1', 'name': 'Record 1'}, {'id': '2', 'rid': '2', 'name': 'Record 2'}, {'id': '3', 'rid': '3', 'name': 'Record 3'}]
  id rid      name
0  1   1  Record 1
1  2   2  Record 2
2  3   3  Record 3

Write XML document

We can also use DOM object to write XML data in memory to files.

The following code snippet adds a new attribute for each record element in the previous example and then save the new XML document to a new file.

with open(f"{dir_path}/test_new.xml", "w") as new_xml_handle:
    with parse(f'{dir_path}/test.xml') as xml_doc:
        root = xml_doc.documentElement
        records = root.getElementsByTagName('record')

        for r in records:
            r.setAttribute('testAttr', 'New Attribute')
        xml_doc.writexml(new_xml_handle)
The new file text_new.xml has the following content after the script is run:
<?xml version="1.0" ?><data>
    <record id="1" testAttr="New Attribute">
        <rid>1</rid>
        <name>Record 1</name>
    </record>
    <record id="2" testAttr="New Attribute">
        <rid>2</rid>
        <name>Record 2</name>
    </record>
    <record id="3" testAttr="New Attribute">
        <rid>3</rid>
        <name>Record 3</name>
    </record>
</data>

As you can see, attribute node testAttr is added for each element. 

Use xml.etree.ElementTree 

Another approach is to use xml.etree.ElementTree to read and write XML files. 

Read XML file using ElementTree

import xml.etree.ElementTree as ET
import os
import pandas as pd

dir_path = os.path.dirname(os.path.realpath(__file__))

data_records = []

tree = ET.parse(f'{dir_path}/test.xml')
root = tree.getroot()
for r in root.getchildren():
    data = {}
    data['id'] = r.attrib['id']
    el_rid = r.find('rid')
    data['rid'] = el_rid.text
    el_name = r.find('name')
    data['name'] = el_name.text
    data_records.append(data)
print(data_records)
df = pd.DataFrame(data_records)
print(df)

The above scripts first create ElementTree object and then find all 'record' elements through the root element. For each 'record' element, it parses the attributes and child elements. The APIs are very similar to the minidom one but is easier to use.

The output looks like the following:

[{'id': '1', 'rid': '1', 'name': 'Record 1'}, {'id': '2', 'rid': '2', 'name': 'Record 2'}, {'id': '3', 'rid': '3', 'name': 'Record 3'}]
  id rid      name
0  1   1  Record 1
1  2   2  Record 2
2  3   3  Record 3

Write XML file using ElementTree

To write XML file we can just call the write function. 

Example code:

for r in root.getchildren():
    r.set('testAttr', 'New Attribute2')
tree.write(f'{dir_path}/test_new_2.xml', xml_declaration=True)

Again, the API is simpler compared with minidom. The content of the newly generated file text_new_2.xml is like the following:

<?xml version='1.0' encoding='us-ascii'?>
<data>
    <record id="1" testAttr="New Attribute2">
        <rid>1</rid>
        <name>Record 1</name>
    </record>
    <record id="2" testAttr="New Attribute2">
        <rid>2</rid>
        <name>Record 2</name>
    </record>
    <record id="3" testAttr="New Attribute2">
        <rid>3</rid>
        <name>Record 3</name>
    </record>
</data>

For ElementTree.write function, you can specify many optional arguments, for example, encoding, XML declaration, etc. 

warning Warning: The xml.etree.ElementTree module is not secure against maliciously constructed data. If you need to parse untrusted or unauthenticated data see XML vulnerabilities.

Summary

There are many other libraries available in Python to allow you to parse and write XML files. For many of these packages, they are not as fluent or complete as Java or .NET equivalent libraries. ElementTree is the closest one I found so far. 

References

Read XML Files as Pandas DataFrame

info Last modified by Administrator at 14 days ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Kontext Column

Created for everyone to publish data, programming and cloud related articles.
Follow three steps to create your columns.


Learn more arrow_forward

More from Kontext

PySpark Read Multiple Lines Records from CSV

local_offer pyspark local_offer spark-2-x local_offer python local_offer spark-file-operations

visibility 1218
thumb_up 0
access_time 6 months ago

CSV is a common format used when extracting and exchanging data between systems and platforms. Once CSV file is ingested into HDFS, you can easily read them as DataFrame in Spark. However there are a few options you need to pay attention to especially if you source file: Has records ac...

Load Microsoft 365 SharePoint List Data in Python

local_offer Azure local_offer python

visibility 9
thumb_up 0
access_time 7 days ago

A Microsoft SharePoint list is a collection of data can be shared with team members or people who you give access to. It is commonly used to capture commonly maintained master data from manual inputs.  This article summarizes steps to create a SharePoint list...

local_offer python local_offer pandas local_offer python-file-operations

visibility 216
thumb_up 0
access_time 10 months ago

Pickle files are commonly used Python data related projects. This article shows how to create and load pickle files using Pandas.  Create pickle file import pandas as pd import numpy as np file_name="data/test.pkl" data = np.random.randn(1000, 2) # pd.set_option('displ...

About column