I just installed python, mplayer, beautifulsoup and sipie to run Sirius on my Ubuntu 10.04 machine. I followed some docs that seem straightforward, but am encountering some iss
Suppose you are using BeautifulSoup4, I found out something in the official document about this: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
If you’re using a version of Python 2 earlier than 2.7.3, or a version of Python 3 earlier than 3.2.2, it’s essential that you install lxml or html5lib–Python’s built-in HTML parser is just not very good in older versions.
I tried this and it works well, just like what @Joshua
soup = BeautifulSoup(r.text, 'html5lib')
Command Line:
$ pip install beautifulsoup4
$ pip install html5lib
Python 3:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'http://www.example.com'
page = urlopen(url)
soup = BeautifulSoup(page.read(), 'html5lib')
links = soup.findAll('a')
for link in links:
print(link.string, link['href'])
Look at column 3 of line 100 in the "data" that is mentioned in File "/usr/bin/Sipie/Sipie/Factory.py", line 298
Newer versions of BeautifulSoup uses HTMLParser rather than SGMLParser (due to SGMLParser being removed from the Python 3.0 standard library). As a result, BeautifulSoup can no longer process many malformed HTML documents correctly, which is what I believe you are encountering here.
A solution to your problem is likely to be to uninstall BeautifulSoup, and install an older version (which will still work with Python 2.6 on Ubuntu 10.04LTS):
sudo apt-get remove python-beautifulsoup
sudo easy_install -U "BeautifulSoup==3.0.7a"
Just be aware that this temporary solution will no longer work with Python 3.0 (which may become the default in future versions of Ubuntu).
The issues you are encountering are pretty common, and they deal specifically with mal-formed HTML. In my case, there was an HTML element which had double quoted an attribute's value. I ran into this issue today actually, and in so doing so came across your post. I was FINALLY able to resolve this issue through parsing the HTML through html5lib before handing it off the BeautifulSoup 4.
First off, you'll need to:
sudo easy_install bs4
sudo apt-get install python-html5lib
Then, run this example code:
from bs4 import BeautifulSoup
import html5lib
from html5lib import sanitizer
from html5lib import treebuilders
import urllib
url = 'http://the-url-to-scrape'
fp = urllib.urlopen(url)
# Create an html5lib parser. Not sure if the sanitizer is required.
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"), tokenizer=sanitizer.HTMLSanitizer)
# Load the source file's HTML into html5lib
html5lib_object = parser.parse(file_pointer)
# In theory we shouldn't need to convert this to a string before passing to BS. Didn't work passing directly to BS for me however.
html_string = str(html5lib_object)
# Load the string into BeautifulSoup for parsing.
soup = BeautifulSoup(html_string)
for content in soup.findAll('div'):
print content
If you have any questions about this code or need a little more specific guidance, just let me know. :)