Scraping XML data with BS4 “lxml”

问题

Trying to solve problem very similar to this one:

[Scraping XML element attributes with beautifulsoup

I have the following code:

from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.usda.gov/oce/commodity/wasde/latest.xml')
data = r.text
soup = BeautifulSoup(data, "lxml")
for ce in soup.find_all("Cell"):
    print(ce["cell_value1"])

The code runs without error but does not print any values to the terminal.

I want to extract the "cell_value1" data noted above for the whole page so I have something like this:

2468.58
3061.58
376.64
and so on...

The format of my XML file is the same as the sample in the solution from the question noted above. I identified the appropriate attribute tag specific the attribute I want to scrape. Why are the values not printing to the terminal?

回答1:

The problem is that you're parsing this file in HTML mode, which means the tags end up named 'cell' instead of 'Cell'. So, you could just search with 'cell'—but the right answer is to parse in XML mode.

To do this, just use 'xml' as your parser instead of 'lxml'. (It's a little non-obvious that 'lxml' means "lxml in HTML mode" and xml means "lxml in XML mode", but it is documented.)

This is explained in Other parser problems:

Because HTML tags and attributes are case-insensitive, all three HTML parsers convert tag and attribute names to lowercase. That is, the markup <TAG></TAG> is converted to <tag></tag>. If you want to preserve mixed-case or uppercase tags and attributes, you’ll need to parse the document as XML.

Your code is still fail because of a second problem: some of the Cell nodes are empty, and do not have a cell_value1 attribute to print out, but you're trying to print it out unconditionally.

So, what you want is something like this:

soup = BeautifulSoup(data, "xml")
for ce in soup.find_all("Cell"):
    try:
        print(ce["cell_value1"])
    except KeyError:
        pass

来源：https://stackoverflow.com/questions/49639450/scraping-xml-data-with-bs4-lxml

标签

python

python-3.x

beautifulsoup

lxml

elementtree