Python XML parsing from website

后端 未结 2 1612
余生分开走
余生分开走 2021-02-15 10:04

I am trying to Parse from a website. I am stuck. I will provide the XML below. It is coming from a webiste. I have two questions. What is the best way to read xml from a website

相关标签:
2条回答
  • 2021-02-15 10:45

    Take a look at your code:

    document = ('http://www.newyorkfed.org/markets/omo/dmm/fftoXML.cfm?type=daily''r')
    web = urllib.urlopen(document)
    get_web = web.read()
    xmldoc = minidom.parseString(document)
    

    I'm not sure you have document correct unless you want http://www.newyorkfed.org/markets/omo/dmm/fftoXML.cfm?type=dailyr because that's what you'll get (the parens group in this case and strings listed next to each other automatically concatenate).

    After that you do some work to create get_web but then you don't use it in the next line. Instead you try to parse your document which is the url...

    Beyond that, I would totally suggest you use ElementTree, preferably lxml's ElementTree (http://lxml.de/). Also, lxml's etree parser takes a file-like object which can be a urllib object. If you did, after straightening out the rest of your doc, you could do this:

    from lxml import etree
    from io import StringIO
    import urllib
    
    url = 'http://www.newyorkfed.org/markets/omo/dmm/fftoXML.cfm?type=daily'
    root = etree.parse(urllib.urlopen(url))
    
    for obs in root.xpath('/ff:DataSet/ff:Series/ff:Obs'):
        price = obs.xpath('./base:OBS_VALUE').text
        print(price)
    
    0 讨论(0)
  • 2021-02-15 10:52

    If you wanted to stick with xml.dom.minidom, try this...

    from xml.dom import minidom
    import urllib
    
    url_str = 'http://www.newyorkfed.org/markets/omo/dmm/fftoXML.cfm?type=daily'
    xml_str = urllib.urlopen(url_str).read()
    xmldoc = minidom.parseString(xml_str)
    
    obs_values = xmldoc.getElementsByTagName('base:OBS_VALUE')
    # prints the first base:OBS_VALUE it finds
    print obs_values[0].firstChild.nodeValue
    
    # prints the second base:OBS_VALUE it finds
    print obs_values[1].firstChild.nodeValue
    
    # prints all base:OBS_VALUE in the XML document
    for obs_val in obs_values:
        print obs_val.firstChild.nodeValue
    

    However, if you want to use lxml, use underrun's solution. Also, your original code had some errors. You were actually attempting to parse the document variable, which was the web address. You needed to parse the xml returned from the website, which in your example is the get_web variable.

    0 讨论(0)
提交回复
热议问题