malformed start tag error - Python, BeautifulSoup, and Sipie - Ubuntu 10.04

前端未结

关注

 5  1921

再見小時候 2021-02-14 23:31

I just installed python, mplayer, beautifulsoup and sipie to run Sirius on my Ubuntu 10.04 machine. I followed some docs that seem straightforward, but am encountering some iss

5条回答

野趣味 (楼主)

2021-02-15 00:31

The issues you are encountering are pretty common, and they deal specifically with mal-formed HTML. In my case, there was an HTML element which had double quoted an attribute's value. I ran into this issue today actually, and in so doing so came across your post. I was FINALLY able to resolve this issue through parsing the HTML through html5lib before handing it off the BeautifulSoup 4.

First off, you'll need to:

sudo easy_install bs4
sudo apt-get install python-html5lib

Then, run this example code:

from bs4 import BeautifulSoup
import html5lib
from html5lib import sanitizer
from html5lib import treebuilders
import urllib

url = 'http://the-url-to-scrape'
fp = urllib.urlopen(url)

# Create an html5lib parser. Not sure if the sanitizer is required.
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"), tokenizer=sanitizer.HTMLSanitizer)
# Load the source file's HTML into html5lib
html5lib_object = parser.parse(file_pointer)
# In theory we shouldn't need to convert this to a string before passing to BS. Didn't work passing directly to BS for me however.
html_string = str(html5lib_object)

# Load the string into BeautifulSoup for parsing.
soup = BeautifulSoup(html_string)

for content in soup.findAll('div'):
    print content

If you have any questions about this code or need a little more specific guidance, just let me know. :)

0 讨论(0)

查看其它5个回答