Python html parsing that actually works

前端未结

关注

 5  1195

轻奢々 2021-01-31 21:09

I\'m trying to parse some html in Python. There were some methods that actually worked before... but nowadays there\'s nothing I can actually use without workarounds.

5条回答

爱一瞬间的悲伤 (楼主)

2021-01-31 21:31

If you are scraping content, an excellent way to get around irritating details is the sitescraper package. It uses machine learning to determine which content to retrieve for you.

From the homepage:

>>> from sitescraper import sitescraper
>>> ss = sitescraper()
>>> url = 'http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field-keywords=python&x=0&y=0'
>>> data = ["Amazon.com: python", 
             ["Learning Python, 3rd Edition", 
             "Programming in Python 3: A Complete Introduction to the Python Language (Developer's Library)", 
             "Python in a Nutshell, Second Edition (In a Nutshell (O'Reilly))"]]
>>> ss.add(url, data)
>>> # we can add multiple example cases, but this is a simple example so 1 will do (I   generally use 3)
>>> # ss.add(url2, data2) 
>>> ss.scrape('http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field-  keywords=linux&x=0&y=0')
["Amazon.com: linux", ["A Practical Guide to Linux(R) Commands, Editors, and Shell    Programming", 
"Linux Pocket Guide", 
"Linux in a Nutshell (In a Nutshell (O'Reilly))", 
'Practical Guide to Ubuntu Linux (Versions 8.10 and 8.04), A (2nd Edition)', 
'Linux Bible, 2008 Edition: Boot up to Ubuntu, Fedora, KNOPPIX, Debian, openSUSE, and 11 Other Distributions']]

0 讨论(0)

查看其它5个回答