Parsing text from XML node in Python

前端 未结 3 883
傲寒
傲寒 2020-12-02 02:36

I\'m trying to extract URLs from a sitemap like this: https://www.bestbuy.com/sitemap_c_0.xml.gz

I\'ve unzipped and saved the .xml.gz file as an .xml file. The struc

相关标签:
3条回答
  • 2020-12-02 03:29

    We can iterate through the URLs, toss them into a list and write them to a file as such:

    from xml.etree import ElementTree as ET
    
    tree = ET.parse('test.xml')
    root = tree.getroot()
    
    name_space = '{http://www.sitemaps.org/schemas/sitemap/0.9}'
    
    urls = []
    for child in root.iter():
        for block in child.findall('{}url'.format(name_space)):
            for url in block.findall('{}loc'.format(name_space)):
                urls.append('{}\n'.format(url.text))
    
    with open('sample_urls.txt', 'w+') as f:
        f.writelines(urls)
    
    • note we need to append the name space from the open urlset definition to properly parse the xml
    0 讨论(0)
  • 2020-12-02 03:36

    You were close in your attempt but like mzjn said in a comment, you didn't account for the default namespace (xmlns="http://www.sitemaps.org/schemas/sitemap/0.9").

    Here's an example of how to account for the namespace:

    import xml.etree.ElementTree as ET
    tree = ET.parse('my_local_filepath')
    
    ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
    
    for elem in tree.findall(".//sm:loc", ns):
        print(elem.text)
    

    output:

    https://www.bestbuy.com/
    https://www.bestbuy.com/site/3d-printers/3d-printer-filament/pcmcat335400050008.c?id=pcmcat335400050008
    https://www.bestbuy.com/site/3d-printers/3d-printing-accessories/pcmcat748300527647.c?id=pcmcat748300527647
    

    Note that I used the namespace prefix sm, but you could use any NCName.

    See here for more information on parsing XML with namespaces in ElementTree.

    0 讨论(0)
  • 2020-12-02 03:38

    I know this is a bit of a zombie reply, but I actually just posted a tool on github that does exactly what you're looking for. And in Python! So feel free to take what you need from the source code (or use it as-is). I figured I'd comment with this so other people who come across this thread would have it.

    Here it is: https://github.com/tcaldron/xmlscrape

    0 讨论(0)
提交回复
热议问题