Beautifulsoup functionality not working properly in specific scenario

问题

I am trying to read in the following url using urllib2: http://frcwest.com/ and then search the data for the meta redirect.

It reads the following data in:

   <!--?xml version="1.0" encoding="UTF-8"?--><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
   <html xmlns="http://www.w3.org/1999/xhtml"><head><title></title><meta content="0;url= Home.html" http-equiv="refresh"/></head><body></body></html>

Reading it into Beautifulsoup works fine. However for some reason none of the functionality works for this specific senarious, and I don't understand why. Beautifulsoup has worked great for me in all other scenarios. However, when simply trying:

    soup.findAll('meta')

produces no results.

My eventual goal is to run:

    soup.find("meta",attrs={"http-equiv":"refresh"})

But if:

    soup.findAll('meta')

isn't even working then I'm stuck. Any incite into this mystery would be appreciated, thanks!

回答1:

It's the comment and doctype that throws the parser here, and subsequently, BeautifulSoup.

Even the HTML tag seems 'gone':

>>> soup.find('html') is None
True

Yet it is there in the .contents iterable still. You can find things again with:

for elem in soup:
    if getattr(elem, 'name', None) == u'html':
        soup = elem
        break

soup.find_all('meta')

Demo:

>>> for elem in soup:
...     if getattr(elem, 'name', None) == u'html':
...         soup = elem
...         break
... 
>>> soup.find_all('meta')
[<meta content="0;url= Home.html" http-equiv="refresh"/>]

来源：https://stackoverflow.com/questions/16134384/beautifulsoup-functionality-not-working-properly-in-specific-scenario

标签

python

beautifulsoup

urllib2

html5lib