Python XML Parse and getElementsByTagName

后端未结

关注

 3  640

感动是毒 2021-01-21 06:05

I was trying to parse the following xml and fetch specific tags that i\'m interested in around my business need. and i guess i\'m doing something wrong. Not sure how to parse my

3条回答

后悔当初 (楼主)

2021-01-21 06:21

First, the xml document wasn't parsing because you copied a raw ampersand & from the source page, which is like a keyword in xml. When your browser renders xml (or html), it converts & into &.

As for the code, the easiest way to get the data is to iterate over df_cols, then execute getElementsByTagName for each column, which will return a list of elements for the given column.

from xml.dom import minidom
import pandas as pd
import urllib

limit = 500
url = f"http://couponfeed.synergy.com/coupon?token=xxxxxxxxx122b&network=1&resultsperpage={limit}"


xmldoc = minidom.parse(urllib.request.urlopen(url))

df_cols = ["promotiontype","category","offerdescription", "offerstartdate", "offerenddate", "clickurl","impressionpixel","advertisername","network"]

# create an object for each row
rows = [{} for i in range(limit)]

nodes = xmldoc.getElementsByTagName("promotiontype")
node = nodes[0]

for row_name in df_cols:

    # get results for each row_name
    nodes = xmldoc.getElementsByTagName(row_name)
    for i, node in enumerate(nodes):
        rows[i][row_name] = node.firstChild.nodeValue


out_df = pd.DataFrame(rows, columns=df_cols)

nodes = et.getElementsByTagName("promotiontype")
node = nodes[0]

for row_name in df_cols:
    nodes = et.getElementsByTagName(row_name)
    for i, node in enumerate(nodes):
        rows[i][row_name] = node.firstChild.nodeValue


out_df = pd.DataFrame(rows, columns=df_cols)

This isn't the most efficient way to do this, but I'm not sure how else to using minidom. If efficiency is a concern, I'd recommend using lxml instead.

0 讨论(0)

查看其它3个回答