Python XML Parse and getElementsByTagName

后端 未结 3 640
感动是毒
感动是毒 2021-01-21 06:05

I was trying to parse the following xml and fetch specific tags that i\'m interested in around my business need. and i guess i\'m doing something wrong. Not sure how to parse my

3条回答
  •  后悔当初
    2021-01-21 06:21

    First, the xml document wasn't parsing because you copied a raw ampersand & from the source page, which is like a keyword in xml. When your browser renders xml (or html), it converts & into &.

    As for the code, the easiest way to get the data is to iterate over df_cols, then execute getElementsByTagName for each column, which will return a list of elements for the given column.

    from xml.dom import minidom
    import pandas as pd
    import urllib
    
    limit = 500
    url = f"http://couponfeed.synergy.com/coupon?token=xxxxxxxxx122b&network=1&resultsperpage={limit}"
    
    
    xmldoc = minidom.parse(urllib.request.urlopen(url))
    
    df_cols = ["promotiontype","category","offerdescription", "offerstartdate", "offerenddate", "clickurl","impressionpixel","advertisername","network"]
    
    # create an object for each row
    rows = [{} for i in range(limit)]
    
    nodes = xmldoc.getElementsByTagName("promotiontype")
    node = nodes[0]
    
    for row_name in df_cols:
    
        # get results for each row_name
        nodes = xmldoc.getElementsByTagName(row_name)
        for i, node in enumerate(nodes):
            rows[i][row_name] = node.firstChild.nodeValue
    
    
    out_df = pd.DataFrame(rows, columns=df_cols)
    
    nodes = et.getElementsByTagName("promotiontype")
    node = nodes[0]
    
    for row_name in df_cols:
        nodes = et.getElementsByTagName(row_name)
        for i, node in enumerate(nodes):
            rows[i][row_name] = node.firstChild.nodeValue
    
    
    out_df = pd.DataFrame(rows, columns=df_cols)
    

    This isn't the most efficient way to do this, but I'm not sure how else to using minidom. If efficiency is a concern, I'd recommend using lxml instead.

提交回复
热议问题