Read and Write XML Files with Python

access_time 3 months ago visibility644 comment 0

XML is a commonly used data exchange format in many applications and systems though JSON became more popular nowadays. Compared with JSON, XML supports schema (XSD) validation and can be easily transformed other formats using XSLT. XHTML is also a strict version of HTML that is XML based and used in many websites.

This article provides some examples of using Python to read and write XML files.

Example XML file

Create a sample XML file named test.xml with the following content:

<?xml version="1.0"?>
<data>
    <record id="1">
        <rid>1</rid>
        <name>Record 1</name>
    </record>
    <record id="2">
        <rid>2</rid>
        <name>Record 2</name>
    </record>
    <record id="3">
        <rid>3</rid>
        <name>Record 3</name>
    </record>
</data>

Use XML DOM model

As many other programming languages, XML DOM is commonly used to parse and to manipulate XML files. 

More than decades ago when I started coding with C#, XmlDocument was a commonly used class to manipulate XML data. In Python, there are also DOM model implementations like package minidom even though it is a minimal implementation of the Document Object Model interface.

Read XML document

The following code snippet reads the attributes from the document. It first creates a DOM object and then finds all the record elements from document root element. For each element, is parsed as a dictionary record. When parsing record element, several objects are used: Attr, Text and Element. All these elements are inherited from Node class. 

from xml.dom.minidom import parse
import os
import pandas as pd

dir_path = os.path.dirname(os.path.realpath(__file__))

data_records = []

with parse(f'{dir_path}/test.xml') as xml_doc:
    root = xml_doc.documentElement
    records = root.getElementsByTagName('record')

    for r in records:
        data = {}
        id_node = r.getAttributeNode('id')
        data['id'] = id_node.value
        el_rid = r.getElementsByTagName('rid')[0]
        data['rid'] = el_rid.firstChild.data
        el_name = r.getElementsByTagName('name')[0]
        data['name'] = el_name.firstChild.data
        data_records.append(data)
print(data_records)
df = pd.DataFrame(data_records)
print(df)
Output:
[{'id': '1', 'rid': '1', 'name': 'Record 1'}, {'id': '2', 'rid': '2', 'name': 'Record 2'}, {'id': '3', 'rid': '3', 'name': 'Record 3'}]
  id rid      name
0  1   1  Record 1
1  2   2  Record 2
2  3   3  Record 3

Write XML document

We can also use DOM object to write XML data in memory to files.

The following code snippet adds a new attribute for each record element in the previous example and then save the new XML document to a new file.

with open(f"{dir_path}/test_new.xml", "w") as new_xml_handle:
    with parse(f'{dir_path}/test.xml') as xml_doc:
        root = xml_doc.documentElement
        records = root.getElementsByTagName('record')

        for r in records:
            r.setAttribute('testAttr', 'New Attribute')
        xml_doc.writexml(new_xml_handle)
The new file text_new.xml has the following content after the script is run:
<?xml version="1.0" ?><data>
    <record id="1" testAttr="New Attribute">
        <rid>1</rid>
        <name>Record 1</name>
    </record>
    <record id="2" testAttr="New Attribute">
        <rid>2</rid>
        <name>Record 2</name>
    </record>
    <record id="3" testAttr="New Attribute">
        <rid>3</rid>
        <name>Record 3</name>
    </record>
</data>

As you can see, attribute node testAttr is added for each element. 

Use xml.etree.ElementTree 

Another approach is to use xml.etree.ElementTree to read and write XML files. 

Read XML file using ElementTree

import xml.etree.ElementTree as ET
import os
import pandas as pd

dir_path = os.path.dirname(os.path.realpath(__file__))

data_records = []

tree = ET.parse(f'{dir_path}/test.xml')
root = tree.getroot()
for r in root.getchildren():
    data = {}
    data['id'] = r.attrib['id']
    el_rid = r.find('rid')
    data['rid'] = el_rid.text
    el_name = r.find('name')
    data['name'] = el_name.text
    data_records.append(data)
print(data_records)
df = pd.DataFrame(data_records)
print(df)

The above scripts first create ElementTree object and then find all 'record' elements through the root element. For each 'record' element, it parses the attributes and child elements. The APIs are very similar to the minidom one but is easier to use.

The output looks like the following:

[{'id': '1', 'rid': '1', 'name': 'Record 1'}, {'id': '2', 'rid': '2', 'name': 'Record 2'}, {'id': '3', 'rid': '3', 'name': 'Record 3'}]
  id rid      name
0  1   1  Record 1
1  2   2  Record 2
2  3   3  Record 3

Write XML file using ElementTree

To write XML file we can just call the write function. 

Example code:

for r in root.getchildren():
    r.set('testAttr', 'New Attribute2')
tree.write(f'{dir_path}/test_new_2.xml', xml_declaration=True)

Again, the API is simpler compared with minidom. The content of the newly generated file text_new_2.xml is like the following:

<?xml version='1.0' encoding='us-ascii'?>
<data>
    <record id="1" testAttr="New Attribute2">
        <rid>1</rid>
        <name>Record 1</name>
    </record>
    <record id="2" testAttr="New Attribute2">
        <rid>2</rid>
        <name>Record 2</name>
    </record>
    <record id="3" testAttr="New Attribute2">
        <rid>3</rid>
        <name>Record 3</name>
    </record>
</data>

For ElementTree.write function, you can specify many optional arguments, for example, encoding, XML declaration, etc. 

warning Warning: The xml.etree.ElementTree module is not secure against maliciously constructed data. If you need to parse untrusted or unauthenticated data see XML vulnerabilities.

Summary

There are many other libraries available in Python to allow you to parse and write XML files. For many of these packages, they are not as fluent or complete as Java or .NET equivalent libraries. ElementTree is the closest one I found so far. 

References

Read XML Files as Pandas DataFrame

info Last modified by Administrator at 3 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Want to publish your article on Kontext?

Learn more

Kontext Column

Created for everyone to publish data, programming and cloud related articles.
Follow three steps to create your columns.


Learn more arrow_forward

More from Kontext

local_offer teradata local_offer python local_offer Java local_offer python-database

visibility 1341
thumb_up 0
access_time 8 months ago

Python JayDeBeApi module allows you to connect from Python to Teradata databases using Java JDBC drivers. In article Connect to Teradata database through Python , I showed how to use teradata package to connect to Teradata via Teradata ODBC driver. This article demos how to use this JayDeBeApi ...

local_offer python

visibility 527
thumb_up 0
access_time 8 months ago

In my previous article about  Convert string to date in Python / Spark , I showed how to use Spark udf to convert string to date in PySpark. Today I'm going to show you how to use pure Python function to convert string to date. datetime.datetime.strptime function is used to convert string to ...

local_offer python local_offer spark local_offer spark-file-operations

visibility 4307
thumb_up 0
access_time 2 years ago

Parquet is columnar store format published by Apache. It's commonly used in Hadoop ecosystem. There are many programming language APIs that have been implemented to support writing and reading parquet files. 

About column