问题
Trying to solve problem very similar to this one:
[Scraping XML element attributes with beautifulsoup
I have the following code:
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.usda.gov/oce/commodity/wasde/latest.xml')
data = r.text
soup = BeautifulSoup(data, "lxml")
for ce in soup.find_all("Cell"):
print(ce["cell_value1"])
The code runs without error but does not print any values to the terminal.
I want to extract the "cell_value1" data noted above for the whole page so I have something like this:
2468.58
3061.58
376.64
and so on...
The format of my XML file is the same as the sample in the solution from the question noted above. I identified the appropriate attribute tag specific the attribute I want to scrape. Why are the values not printing to the terminal?
回答1:
The problem is that you're parsing this file in HTML mode, which means the tags end up named 'cell'
instead of 'Cell'
. So, you could just search with 'cell'
—but the right answer is to parse in XML mode.
To do this, just use 'xml'
as your parser instead of 'lxml'
. (It's a little non-obvious that 'lxml'
means "lxml
in HTML mode" and xml
means "lxml
in XML mode", but it is documented.)
This is explained in Other parser problems:
Because HTML tags and attributes are case-insensitive, all three HTML parsers convert tag and attribute names to lowercase. That is, the markup
<TAG></TAG>
is converted to<tag></tag>
. If you want to preserve mixed-case or uppercase tags and attributes, you’ll need to parse the document as XML.
Your code is still fail because of a second problem: some of the Cell
nodes are empty, and do not have a cell_value1
attribute to print out, but you're trying to print it out unconditionally.
So, what you want is something like this:
soup = BeautifulSoup(data, "xml")
for ce in soup.find_all("Cell"):
try:
print(ce["cell_value1"])
except KeyError:
pass
来源:https://stackoverflow.com/questions/49639450/scraping-xml-data-with-bs4-lxml