Convert XML structure into a Pandas DataFrame


Updated June 10th, 2017: Special thanks to all the contributors in the comments section!

TLDR;

import xml.etree.ElementTree as ET
import pandas as pd

xml_data = open('/path/user_agents.xml').read()

def xml2df(xml_data):
    root = ET.XML(xml_data) # element tree
    all_records = []
    for i, child in enumerate(root):
        record = {}
        for subchild in child:
            record[subchild.tag] = subchild.text
            all_records.append(record)
    return pd.DataFrame(all_records)

Our Goal

Convert any XML file into a pandas dataframe.

I found a lot of examples on the internet of how to convert XML into DataFrames, but each example was very tailored. Our version will take in any XML file and format the headers properly.

For this example, we’re going to convert a User Agent tracker XML feed

Sample Data

Sample XML Data

Before running the sure to download the xml file above.

wget http://www.user-agents.org/allagents.xml

Code Walkthrough

It’s fairly straight forward, so I’ll comment each line to explain.

# BEGIN IMPORTS
import xml.etree.ElementTree as ET
import pandas as pd
# END IMPORTS

xml_data = open('/path/user_agents.xml').read() #Loading the raw XML data

def xml2df(xml_data):
    root = ET.XML(xml_data) # element tree
    all_records = [] #This is our record list which we will convert into a dataframe
    for i, child in enumerate(root): #Begin looping through our root tree
        record = {} #Place holder for our record
        for subchild in child: #iterate through the subchildren to user-agent, Ex: ID, String, Description.
            record[subchild.tag] = subchild.text #Extract the text create a new dictionary key, value pair
            all_records.append(record) #Append this record to all_records.
    return pd.DataFrame(all_records) #return records as DataFrame

If all went well you should now have a DataFrame as seen below:

Converted DataFrame

ID String Description Type Comment Link1 Link2
0 id_a_f_3 !Susie (http://www.sync2it.com/susie) Sync2It bookmark management & clustering engine C R None http://www.sync2it.com None
1 id_a_f_6 <a href='http://www.unchaos.com/'> UnChaos </a... UnCHAOS search robot R Site is dead http://www.unchaos.com/ None
2 id_a_f_7 <a href='http://www.unchaos.com/'> UnChaos Bot... UnCHAOS search robot R Site is dead http://www.unchaos.com/ None

Thanks for reading!

If you found this helpful, please feel free to share and comment below.