I have a large HTML source code I would like to parse (~200,000) lines, and I\'m fairly certain there is some poor formatting throughout. I\'ve been researching some parsers
My understanding (having used BeautifulSoup for a handful of things) is that it is a wrapper for parsers like lxml or html5lib. Using whichever parser is specified (I believe the default is HTMLParser, the default parser for python), BeautifulSoup creates a tree of tag elements and such that make it quite easy to navigate and search the HTML for useful data continued within tags. If you really just need the text from the webpages and not more specific data from specific HTML tags, you might only need a code snippet similar to this:
from bs4 import BeautifulSoup
import urllib2
soup = BeautifulSoup(urllib2.urlopen("http://www.google.com")
soup.get_text()
get_text isn't that great with complex webpages (it gets random javascript or css occasionally), but if you get the hang of how to use BeautifulSoup, it shouldn't be hard to get only the text you want.
For your purposes it seems like you don't need to worry about getting one of those other parsers to use with BeautifulSoup (html5lib or lxml). BeautifulSoup can deal with some sloppiness on its own, and if it can't, it will give an obvious error about "malformed HTML" or something of the sort, and that would be an indication to install html5lib or lxml.