How to convert an XML file to nice pandas dataframe?

前端 未结 4 1725
一向
一向 2020-11-22 16:30

Let\'s assume that I have an XML like this:



        
相关标签:
4条回答
  • 2020-11-22 17:06

    You can also convert by creating a dictionary of elements and then directly converting to a data frame:

    import xml.etree.ElementTree as ET
    import pandas as pd
    
    # Contents of test.xml
    # <?xml version="1.0" encoding="utf-8"?> <tags>   <row Id="1" TagName="bayesian" Count="4699" ExcerptPostId="20258" WikiPostId="20257" />   <row Id="2" TagName="prior" Count="598" ExcerptPostId="62158" WikiPostId="62157" />   <row Id="3" TagName="elicitation" Count="10" />   <row Id="5" TagName="open-source" Count="16" /> </tags>
    
    root = ET.parse('test.xml').getroot()
    
    tags = {"tags":[]}
    for elem in root:
        tag = {}
        tag["Id"] = elem.attrib['Id']
        tag["TagName"] = elem.attrib['TagName']
        tag["Count"] = elem.attrib['Count']
        tags["tags"]. append(tag)
    
    df_users = pd.DataFrame(tags["tags"])
    df_users.head()
    
    0 讨论(0)
  • 2020-11-22 17:06

    Chiming in to recommend the use of the xmltodict library. It handled your xml text pretty well and I've used it for ingesting an xml file with almost a million records.

    0 讨论(0)
  • 2020-11-22 17:20

    You can easily use xml (from the Python standard library) to convert to a pandas.DataFrame. Here's what I would do (when reading from a file replace xml_data with the name of your file or file object):

    import pandas as pd
    import xml.etree.ElementTree as ET
    import io
    
    def iter_docs(author):
        author_attr = author.attrib
        for doc in author.iter('document'):
            doc_dict = author_attr.copy()
            doc_dict.update(doc.attrib)
            doc_dict['data'] = doc.text
            yield doc_dict
    
    xml_data = io.StringIO(u'''\
    <author type="XXX" language="EN" gender="xx" feature="xx" web="foobar.com">
        <documents count="N">
            <document KEY="e95a9a6c790ecb95e46cf15bee517651" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
    ]]>
            </document>
            <document KEY="bc360cfbafc39970587547215162f0db" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
    ]]>
            </document>
            <document KEY="19e71144c50a8b9160b3f0955e906fce" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
    ]]>
            </document>
            <document KEY="21d4af9021a174f61b884606c74d9e42" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
    ]]>
            </document>
            <document KEY="28a45eb2460899763d709ca00ddbb665" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
    ]]>
            </document>
            <document KEY="a0c0712a6a351f85d9f5757e9fff8946" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
    ]]>
            </document>
            <document KEY="626726ba8d34d15d02b6d043c55fe691" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
    ]]>
            </document>
            <document KEY="2cb473e0f102e2e4a40aa3006e412ae4" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] [...]
    ]]>
            </document>
        </documents>
    </author>
    ''')
    
    etree = ET.parse(xml_data) #create an ElementTree object 
    doc_df = pd.DataFrame(list(iter_docs(etree.getroot())))
    

    If there are multiple authors in your original document or the root of your XML is not an author, then I would add the following generator:

    def iter_author(etree):
        for author in etree.iter('author'):
            for row in iter_docs(author):
                yield row
    

    and change doc_df = pd.DataFrame(list(iter_docs(etree.getroot()))) to doc_df = pd.DataFrame(list(iter_author(etree)))

    Have a look at the ElementTree tutorial provided in the xml library documentation.

    0 讨论(0)
  • 2020-11-22 17:22

    Here is another way of converting a xml to pandas data frame. For example i have parsing xml from a string but this logic holds good from reading file as well.

    import pandas as pd
    import xml.etree.ElementTree as ET
    
    xml_str = '<?xml version="1.0" encoding="utf-8"?>\n<response>\n <head>\n  <code>\n   200\n  </code>\n </head>\n <body>\n  <data id="0" name="All Categories" t="2018052600" tg="1" type="category"/>\n  <data id="13" name="RealEstate.com.au [H]" t="2018052600" tg="1" type="publication"/>\n </body>\n</response>'
    
    etree = ET.fromstring(xml_str)
    dfcols = ['id', 'name']
    df = pd.DataFrame(columns=dfcols)
    
    for i in etree.iter(tag='data'):
        df = df.append(
            pd.Series([i.get('id'), i.get('name')], index=dfcols),
            ignore_index=True)
    
    df.head()
    
    0 讨论(0)
提交回复
热议问题