Convert XML structure into a Pandas DataFrame


Updated January 14, 2018: Special thanks to all the contributors in the comments section!

TLDR;

Gather XML Data

import requests

user_agent_url = 'http://www.user-agents.org/allagents.xml'
xml_data = requests.get(user_agent_url).content

Parse XML Data


import xml.etree.ElementTree as ET
import pandas as pd

class XML2DataFrame:

def __init__(self, xml_data):
self.root = ET.XML(xml_data)

def parse_root(self, root):
return [parse_element(child) for child in iter(root)]

def parse_element(self, element, parsed=None):
if parsed is None:
parsed = dict()
for key in element.keys():
parsed[key] = element.attrib.get(key)
if element.text:
parsed[element.tag] = element.text
for child in list(element):
parse_element(child, parsed)
return parsed

def process_data(self):
structure_data = self.parse_root(self.root)
return pd.DataFrame(structure_data)

xml2df = XML2DataFrame(xml_data)
xml_dataframe = xml2df.process_data()

Our Goal

Convert XML file into a pandas dataframe.

I found a lot of examples on the internet of how to convert XML into DataFrames, but each example was very tailored. Our version will take in most XML data and format the headers properly. Some customization may be required depending on your data structure.

For this example, we’re going to convert a User Agent tracker XML feed

Sample Data

Sample XML Data

Before running the sure to download the xml file above.

import requests

user_agent_url = 'http://www.user-agents.org/allagents.xml'
xml_data = requests.get(user_agent_url).content

Code Walkthrough

It’s fairly straight forward, so I’ll comment each line to explain.

Parse XML Data


import xml.etree.ElementTree as ET
import pandas as pd

class XML2DataFrame:

def __init__(self, xml_data):
self.root = ET.XML(xml_data)

def parse_root(self, root):
"""Return a list of dictionaries from the text
and attributes of the children under this XML root."""

return [parse_element(child) for child in iter(root)]

def parse_element(self, element, parsed=None):
""" Collect {key:attribute} and {tag:text} from thie XML
element and all its children into a single dictionary of strings."""

if parsed is None:
parsed = dict()

for key in element.keys():
parsed[key] = element.attrib.get(key)

if element.text:
parsed[element.tag] = element.text

""" Apply recursion"""
for child in list(element):
parse_element(child, parsed)

return parsed

def process_data(self):
""" Initiate the root XML, parse it, and return a dataframe"""
structure_data = self.parse_root(self.root)
return pd.DataFrame(structure_data)

xml2df = XML2DataFrame(xml_data)
xml_dataframe = xml2df.process_data()

If all went well you should now have a DataFrame as seen below:

Converted DataFrame

ID String Description Type Comment Link1 Link2
0 id_a_f_3 !Susie (http://www.sync2it.com/susie) Sync2It bookmark management & clustering engine C R None http://www.sync2it.com None
1 id_a_f_6 <a href='http://www.unchaos.com/'> UnChaos </a... UnCHAOS search robot R Site is dead http://www.unchaos.com/ None
2 id_a_f_7 <a href='http://www.unchaos.com/'> UnChaos Bot... UnCHAOS search robot R Site is dead http://www.unchaos.com/ None

Thanks for reading!

If you found this helpful, please feel free to share and comment below.