I\'m trying to parse some html in Python. There were some methods that actually worked before... but nowadays there\'s nothing I can actually use without workarounds.
If you are scraping content, an excellent way to get around irritating details is the sitescraper package. It uses machine learning to determine which content to retrieve for you.
From the homepage:
>>> from sitescraper import sitescraper
>>> ss = sitescraper()
>>> url = 'http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field-keywords=python&x=0&y=0'
>>> data = ["Amazon.com: python",
["Learning Python, 3rd Edition",
"Programming in Python 3: A Complete Introduction to the Python Language (Developer's Library)",
"Python in a Nutshell, Second Edition (In a Nutshell (O'Reilly))"]]
>>> ss.add(url, data)
>>> # we can add multiple example cases, but this is a simple example so 1 will do (I generally use 3)
>>> # ss.add(url2, data2)
>>> ss.scrape('http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field- keywords=linux&x=0&y=0')
["Amazon.com: linux", ["A Practical Guide to Linux(R) Commands, Editors, and Shell Programming",
"Linux Pocket Guide",
"Linux in a Nutshell (In a Nutshell (O'Reilly))",
'Practical Guide to Ubuntu Linux (Versions 8.10 and 8.04), A (2nd Edition)',
'Linux Bible, 2008 Edition: Boot up to Ubuntu, Fedora, KNOPPIX, Debian, openSUSE, and 11 Other Distributions']]