parse xml to pandas data frame in python

前端 未结 2 877
北恋
北恋 2021-01-21 00:27

I am trying to read the XML file and convert it to pandas. However it returns empty data

This is the sample of xml structure:


         


        
2条回答
  •  滥情空心
    2021-01-21 00:43

    The problem in your solution was that the "element data extraction" was not done properly. The xml you mentioned in the question is nested in several layers. And that is why we need to recursively read and extract the data. The following solution should give you what you need in this case. Although I would encourage you to look at this article and the python documentation for more clarity.

    Method: 1

    import numpy as np
    import pandas as pd
    #import os
    import xml.etree.ElementTree as ET
    
    def xml2df(xml_source, df_cols, source_is_file = False, show_progress=True): 
        """Parse the input XML source and store the result in a pandas 
        DataFrame with the given columns. 
    
        For xml_source = xml_file, Set: source_is_file = True
        For xml_source = xml_string, Set: source_is_file = False
    
        
            Child 1 Text
            Child 2 Text
            Child 3 Text
        
        Note that for an xml structure as shown above, the attribute information of 
        element tag can be accessed by list(element). Any text associated with  tag can be accessed
        as element.text and the name of the tag itself can be accessed with
        element.tag.
        """
        if source_is_file:
            xtree = ET.parse(xml_source) # xml_source = xml_file
            xroot = xtree.getroot()
        else:
            xroot = ET.fromstring(xml_source) # xml_source = xml_string
        consolidator_dict = dict()
        default_instance_dict = {label: None for label in df_cols}
    
        def get_children_info(children, instance_dict):
            # We avoid using element.getchildren() as it is deprecated.
            # Instead use list(element) to get a list of attributes.
            for child in children:
                #print(child)
                #print(child.tag)
                #print(child.items())
                #print(child.getchildren()) # deprecated method
                #print(list(child))
                if len(list(child))>0:
                    instance_dict = get_children_info(list(child), 
                                                      instance_dict)
    
                if len(list(child.keys()))>0:
                    items = child.items()
                    instance_dict.update({key: value for (key, value) in items})             
    
                #print(child.keys())
                instance_dict.update({child.tag: child.text})
            return instance_dict
    
        # Loop over all instances
        for instance in list(xroot):
            instance_dict = default_instance_dict.copy()           
            ikey, ivalue = instance.items()[0] # The first attribute is "ID"
            instance_dict.update({ikey: ivalue}) 
            if show_progress:
                print('{}: {}={}'.format(instance.tag, ikey, ivalue))
            # Loop inside every instance
            instance_dict = get_children_info(list(instance), 
                                              instance_dict)   
    
            #consolidator_dict.update({ivalue: instance_dict.copy()}) 
            consolidator_dict[ivalue] = instance_dict.copy()       
        df = pd.DataFrame(consolidator_dict).T 
        df = df[df_cols]
    
        return df
    

    Run the following to generate the desired output.

    xml_source = r'grade_data.xml'
    df_cols = ["ID", "TaskID", "DataSource", "ProblemDescription", "Question", "Answer",
               "ContextRequired", "ExtraInfoInAnswer", "Comments", "Watch", 'ReferenceAnswers']
    
    df = xml2df(xml_source, df_cols, source_is_file = True)
    df
    

    Method: 2

    Given you have the xml_string, you could convert xml >> dict >> dataframe. run the following to get the desired output.

    Note: You will need to install xmltodict to use Method-2. This method is inspired by the solution suggested by @martin-blech at How to convert XML to JSON in Python? [duplicate] . Kudos to @martin-blech for making it.

    pip install -U xmltodict
    

    Solution

    def read_recursively(x, instance_dict):  
        #print(x)
        txt = ''
        for key in x.keys():
            k = key.replace("@","")
            if k in df_cols: 
                if isinstance(x.get(key), dict):
                    instance_dict, txt = read_recursively(x.get(key), instance_dict)
                #else:                
                instance_dict.update({k: x.get(key)})
                #print('{}: {}'.format(k, x.get(key)))
            else:
                #print('else: {}: {}'.format(k, x.get(key)))
                # dig deeper if value is another dict
                if isinstance(x.get(key), dict):
                    instance_dict, txt = read_recursively(x.get(key), instance_dict)                
                # add simple text associated with element
                if k=='#text':
                    txt = x.get(key)
            # update text to corresponding parent element    
            if (k!='#text') and (txt!=''):
                instance_dict.update({k: txt})
        return (instance_dict, txt)
    

    You will need the function read_recursively() given above. Now run the following.

    import xmltodict, json
    
    o = xmltodict.parse(xml_string) # INPUT: XML_STRING
    #print(json.dumps(o)) # uncomment to see xml to json converted string
    
    consolidated_dict = dict()
    oi = o['Instances']['Instance']
    
    for x in oi:
        instance_dict = dict()
        instance_dict, _ = read_recursively(x, instance_dict)
        consolidated_dict.update({x.get("@ID"): instance_dict.copy()})
    df = pd.DataFrame(consolidated_dict).T
    df = df[df_cols]
    df
    

提交回复
热议问题