When screen-scraping a webpage using python one has to know the character encoding of the page. If you get the character encoding wrong th
I would use html5lib for this.