Python XML Parse and getElementsByTagName

后端 未结 3 638
感动是毒
感动是毒 2021-01-21 06:05

I was trying to parse the following xml and fetch specific tags that i\'m interested in around my business need. and i guess i\'m doing something wrong. Not sure how to parse my

相关标签:
3条回答
  • 2021-01-21 06:11

    Another method.

    from simplified_scrapy import SimplifiedDoc, utils, req
    # html = req.get('http://couponfeed.synergy.com/coupon?token=xxxxxxxxx122b&network=1&resultsperpage=500')
    html = '''
    <couponfeed>
     <TotalMatches>1459</TotalMatches>
     <TotalPages>3</TotalPages>
     <PageNumberRequested>1</PageNumberRequested>
     <link type="TEXT">
      <categories>
       <category id="1">Apparel</category>
      </categories>
      <promotiontypes>
        <promotiontype id="11">Percentage off</promotiontype>
       </promotiontypes>
       <offerdescription>25% Off Boys Quiksilver Apparel. Shop now at Macys.com! Valid 7/23 through 7/25!</offerdescription>
       <offerstartdate>2020-07-24</offerstartdate>
       <offerenddate>2020-07-26</offerenddate>
       <clickurl>https://click.synergy.com/fs-bin/click?id=Z&offerid=777210.100474694&type=3&subid=0</clickurl>
        <impressionpixel>https://ad.synergy.com/fs-bin/show?id=ZNAweM&bids=777210.100474694&type=3&subid=0</impressionpixel>
        <advertiserid>3184</advertiserid>
        <advertisername>cys.com</advertisername>
        <network id="1">US Network</network>
      </link>
     </couponfeed>
    '''
    doc = SimplifiedDoc(html)
    df_cols = [
        "promotiontype", "category", "offerdescription", "offerstartdate",
        "offerenddate", "clickurl", "impressionpixel", "advertisername", "network"
    ]
    rows = [df_cols]
    
    links = doc.couponfeed.links  # Get all links
    for link in links:
        row = []
        for col in df_cols:
            row.append(link.select(col).text)  # Get col text
        rows.append(row)
    
    utils.save2csv('merchants_offers_share.csv', rows)  # Save to csv file
    

    Result:

    promotiontype,category,offerdescription,offerstartdate,offerenddate,clickurl,impressionpixel,advertisername,network
    Percentage off,Apparel,25% Off Boys Quiksilver Apparel. Shop now at Macys.com! Valid 7/23 through 7/25!,2020-07-24,2020-07-26,https://click.synergy.com/fs-bin/click?id=Z&offerid=777210.100474694&type=3&subid=0,https://ad.synergy.com/fs-bin/show?id=ZNAweM&bids=777210.100474694&type=3&subid=0,cys.com,US Network
    

    Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

    Remove the last empty row

    import io
    with io.open('merchants_offers_share.csv', "rb+") as f:
        f.seek(-1,2)
        l = f.read()
        if l == b"\n":
            f.seek(-2,2)
            f.truncate()
    
    0 讨论(0)
  • 2021-01-21 06:21

    First, the xml document wasn't parsing because you copied a raw ampersand & from the source page, which is like a keyword in xml. When your browser renders xml (or html), it converts &amp; into &.

    As for the code, the easiest way to get the data is to iterate over df_cols, then execute getElementsByTagName for each column, which will return a list of elements for the given column.

    from xml.dom import minidom
    import pandas as pd
    import urllib
    
    limit = 500
    url = f"http://couponfeed.synergy.com/coupon?token=xxxxxxxxx122b&network=1&resultsperpage={limit}"
    
    
    xmldoc = minidom.parse(urllib.request.urlopen(url))
    
    df_cols = ["promotiontype","category","offerdescription", "offerstartdate", "offerenddate", "clickurl","impressionpixel","advertisername","network"]
    
    # create an object for each row
    rows = [{} for i in range(limit)]
    
    nodes = xmldoc.getElementsByTagName("promotiontype")
    node = nodes[0]
    
    for row_name in df_cols:
    
        # get results for each row_name
        nodes = xmldoc.getElementsByTagName(row_name)
        for i, node in enumerate(nodes):
            rows[i][row_name] = node.firstChild.nodeValue
    
    
    out_df = pd.DataFrame(rows, columns=df_cols)
    
    nodes = et.getElementsByTagName("promotiontype")
    node = nodes[0]
    
    for row_name in df_cols:
        nodes = et.getElementsByTagName(row_name)
        for i, node in enumerate(nodes):
            rows[i][row_name] = node.firstChild.nodeValue
    
    
    out_df = pd.DataFrame(rows, columns=df_cols)
    

    This isn't the most efficient way to do this, but I'm not sure how else to using minidom. If efficiency is a concern, I'd recommend using lxml instead.

    0 讨论(0)
  • 2021-01-21 06:25

    Assuming no issue with parsing your XML from URL (since link is not available on our end), your first lxml can work if you parse on actual nodes. Specifically, there is no <item> node in XML document.

    Instead use link. And consider a nested list/dict comprehension to migrate content to a data frame. For lxml you can swap out findall and xpath to return same result.

    df = pd.DataFrame([{item.tag: item.text if item.text.strip() != "" else item.find("*").text
                           for item in lnk.findall("*") if item is not None} 
                           for lnk in root.findall('.//link')])
                           
    print(df)
    #   categories  promotiontypes                                   offerdescription  ... advertiserid advertisername     network
    # 0    Apparel  Percentage off  25% Off Boys Quiksilver Apparel. Shop now at M...  ...         3184        cys.com  US Network
    # 1    Apparel  Percentage off  25% Off Boys' Quiksilver Apparel. Shop now at ...  ...         3184        cys.com  US Network
    
    0 讨论(0)
提交回复
热议问题