Convert XML structure into a Pandas DataFrame


Tested with Python 3 and updated December 16, 2019: Special thanks to all the contributors in the comments section!

TLDR;

Gather XML Data

import requests

user_agent_url = 'http://www.user-agents.org/allagents.xml'
xml_data = requests.get(user_agent_url).content

Parse XML Data


import xml.etree.ElementTree as ET
import pandas as pd
class XML2DataFrame:

    def __init__(self, xml_data):
        self.root = ET.XML(xml_data)

    def parse_root(self, root):
        """Return a list of dictionaries from the text and attributes of the
        children under this XML root."""
        return [parse_element(child) for child in root.getchildren()]

    def parse_element(self, element, parsed=None):
        """ Collect {key:attribute} and {tag:text} from thie XML
         element and all its children into a single dictionary of strings."""
        if parsed is None:
            parsed = dict()

        for key in element.keys():
            if key not in parsed:
                parsed[key] = element.attrib.get(key)
            if element.text:
                parsed[element.tag] = element.text                
            else:
                raise ValueError('duplicate attribute {0} at element {1}'.format(key, element.getroottree().getpath(element)))           

        """ Apply recursion"""
        for child in list(element):
            self.parse_element(child, parsed)
        return parsed

    def process_data(self):
        """ Initiate the root XML, parse it, and return a dataframe"""
        structure_data = self.parse_root(self.root)
        return pd.DataFrame(structure_data)

xml2df = XML2DataFrame(xml_data)
xml_dataframe = xml2df.process_data()

Our Goal

Convert XML file into a pandas dataframe.

I found a lot of examples on the internet of how to convert XML into DataFrames, but each example was very tailored. Our version will take in most XML data and format the headers properly. Some customization may be required depending on your data structure.

For this example, we’re going to convert a User Agent tracker XML feed

Sample Data

Sample XML Data

Before running the sure to download the xml file above.

import requests

user_agent_url = 'http://www.user-agents.org/allagents.xml'
xml_data = requests.get(user_agent_url).content

Code Walkthrough

It’s fairly straight forward, so I’ll comment each line to explain.

Parse XML Data


import xml.etree.ElementTree as ET
import pandas as pd
class XML2DataFrame:

    def __init__(self, xml_data):
        self.root = ET.XML(xml_data)

    def parse_root(self, root):
        """Return a list of dictionaries from the text and attributes of the
        children under this XML root."""
        return [parse_element(child) for child in root.getchildren()]

    def parse_element(self, element, parsed=None):
        """ Collect {key:attribute} and {tag:text} from thie XML
         element and all its children into a single dictionary of strings."""
        if parsed is None:
            parsed = dict()

        for key in element.keys():
            if key not in parsed:
                parsed[key] = element.attrib.get(key)
            if element.text:
                parsed[element.tag] = element.text                
            else:
                raise ValueError('duplicate attribute {0} at element {1}'.format(key, element.getroottree().getpath(element)))           

        """ Apply recursion"""
        for child in list(element):
            self.parse_element(child, parsed)
        return parsed

    def process_data(self):
        """ Initiate the root XML, parse it, and return a dataframe"""
        structure_data = self.parse_root(self.root)
        return pd.DataFrame(structure_data)

xml2df = XML2DataFrame(xml_data)
xml_dataframe = xml2df.process_data()

If all went well you should now have a DataFrame as seen below:

Converted DataFrame

ID String Description Type Comment Link1 Link2
0 id_a_f_3 !Susie (http://www.sync2it.com/susie) Sync2It bookmark management & clustering engine C R None http://www.sync2it.com None
1 id_a_f_6 <a href='http://www.unchaos.com/'> UnChaos </a... UnCHAOS search robot R Site is dead http://www.unchaos.com/ None
2 id_a_f_7 <a href='http://www.unchaos.com/'> UnChaos Bot... UnCHAOS search robot R Site is dead http://www.unchaos.com/ None

Thanks for reading!

If you found this helpful, please feel free to share and comment below.