Python html parsing that actually works

前端 未结 5 1195
轻奢々
轻奢々 2021-01-31 21:09

I\'m trying to parse some html in Python. There were some methods that actually worked before... but nowadays there\'s nothing I can actually use without workarounds.

5条回答
  •  爱一瞬间的悲伤
    2021-01-31 21:31

    If you are scraping content, an excellent way to get around irritating details is the sitescraper package. It uses machine learning to determine which content to retrieve for you.

    From the homepage:

    >>> from sitescraper import sitescraper
    >>> ss = sitescraper()
    >>> url = 'http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field-keywords=python&x=0&y=0'
    >>> data = ["Amazon.com: python", 
                 ["Learning Python, 3rd Edition", 
                 "Programming in Python 3: A Complete Introduction to the Python Language (Developer's Library)", 
                 "Python in a Nutshell, Second Edition (In a Nutshell (O'Reilly))"]]
    >>> ss.add(url, data)
    >>> # we can add multiple example cases, but this is a simple example so 1 will do (I   generally use 3)
    >>> # ss.add(url2, data2) 
    >>> ss.scrape('http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field-  keywords=linux&x=0&y=0')
    ["Amazon.com: linux", ["A Practical Guide to Linux(R) Commands, Editors, and Shell    Programming", 
    "Linux Pocket Guide", 
    "Linux in a Nutshell (In a Nutshell (O'Reilly))", 
    'Practical Guide to Ubuntu Linux (Versions 8.10 and 8.04), A (2nd Edition)', 
    'Linux Bible, 2008 Edition: Boot up to Ubuntu, Fedora, KNOPPIX, Debian, openSUSE, and 11 Other Distributions']]
    

提交回复
热议问题