Convert XML structure into a Pandas DataFrame


TLDR;

import xml.etree.ElementTree as ET
from lxml import etree
import pandas as pd

xml_data = '/path/user_agents.xml'

def xml2df(xml_data):
    tree = ET.parse(xml_data)
    root = tree.getroot()
    all_records = []
    headers = []
    for i, child in enumerate(root):
        record = []
        for subchild in child:
            record.append(subchild.text)
            if subchild.tag not in headers:
                headers.append(subchild.tag)
        all_records.append(record)
    return pd.DataFrame(all_records, columns=headers)

Our Goal

Convert any XML file into a pandas dataframe.

I found a lot of examples on the internet of how to convert XML into DataFrames, but each example was very tailored. Our version will take in any XML file and format the headers properly.

For this example, we’re going to convert a User Agent tracker XML feed

Sample Data

Sample XML Data

Before running the sure to download the xml file above.

wget http://www.user-agents.org/allagents.xml

Code Walkthrough

It’s fairly straight forward, so I’ll comment each line to explain.

# BEGIN IMPORTS
import xml.etree.ElementTree as ET
from lxml import etree
import pandas as pd
# END IMPORTS

xml_data = '/path/user_agents.xml' #Use the path where the xml data is located.

def xml2df(xml_data):
    tree = ET.parse(xml_data) #Initiates the tree Ex: <user-agents>
    root = tree.getroot() #Starts the root of the tree Ex: <user-agent>
    all_records = [] #This is our record list which we will convert into a dataframe
    headers = [] #Subchildren tags will be parsed and appended here
    for i, child in enumerate(root): #Begin looping through our root tree
        record = [] #Place holder for our record
        for subchild in child: #iterate through the subchildren to user-agent, Ex: ID, String, Description.
            record.append(subchild.text) #Extract the text and append it to our record list
            if subchild.tag not in headers: #Check the header list to see if the subchild tag <ID>, <String>... is in our headers field. If not append it. This will be used for our headers.
                headers.append(subchild.tag)
        all_records.append(record) #Append this record to all_records.
    return pd.DataFrame(all_records, columns=headers) #Finally, return our Pandas dataframe with headers in the column.

If all went well you should now have a DataFrame as seen below:

Converted DataFrame

ID String Description Type Comment Link1 Link2
0 id_a_f_3 !Susie (http://www.sync2it.com/susie) Sync2It bookmark management & clustering engine C R None http://www.sync2it.com None
1 id_a_f_6 <a href='http://www.unchaos.com/'> UnChaos </a... UnCHAOS search robot R Site is dead http://www.unchaos.com/ None
2 id_a_f_7 <a href='http://www.unchaos.com/'> UnChaos Bot... UnCHAOS search robot R Site is dead http://www.unchaos.com/ None

Thanks for reading!

If you found this helpful, please feel free to share and comment below.