问题
I am trying to read in the following url using urllib2: http://frcwest.com/ and then search the data for the meta redirect.
It reads the following data in:
<!--?xml version="1.0" encoding="UTF-8"?--><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head><title></title><meta content="0;url= Home.html" http-equiv="refresh"/></head><body></body></html>
Reading it into Beautifulsoup works fine. However for some reason none of the functionality works for this specific senarious, and I don't understand why. Beautifulsoup has worked great for me in all other scenarios. However, when simply trying:
soup.findAll('meta')
produces no results.
My eventual goal is to run:
soup.find("meta",attrs={"http-equiv":"refresh"})
But if:
soup.findAll('meta')
isn't even working then I'm stuck. Any incite into this mystery would be appreciated, thanks!
回答1:
It's the comment and doctype that throws the parser here, and subsequently, BeautifulSoup.
Even the HTML tag seems 'gone':
>>> soup.find('html') is None
True
Yet it is there in the .contents
iterable still. You can find things again with:
for elem in soup:
if getattr(elem, 'name', None) == u'html':
soup = elem
break
soup.find_all('meta')
Demo:
>>> for elem in soup:
... if getattr(elem, 'name', None) == u'html':
... soup = elem
... break
...
>>> soup.find_all('meta')
[<meta content="0;url= Home.html" http-equiv="refresh"/>]
来源:https://stackoverflow.com/questions/16134384/beautifulsoup-functionality-not-working-properly-in-specific-scenario