malformed start tag error - Python, BeautifulSoup, and Sipie - Ubuntu 10.04

前端 未结 5 1896
再見小時候
再見小時候 2021-02-14 23:31

I just installed python, mplayer, beautifulsoup and sipie to run Sirius on my Ubuntu 10.04 machine. I followed some docs that seem straightforward, but am encountering some iss

5条回答
  •  野趣味
    野趣味 (楼主)
    2021-02-15 00:31

    The issues you are encountering are pretty common, and they deal specifically with mal-formed HTML. In my case, there was an HTML element which had double quoted an attribute's value. I ran into this issue today actually, and in so doing so came across your post. I was FINALLY able to resolve this issue through parsing the HTML through html5lib before handing it off the BeautifulSoup 4.

    First off, you'll need to:

    sudo easy_install bs4
    sudo apt-get install python-html5lib
    

    Then, run this example code:

    from bs4 import BeautifulSoup
    import html5lib
    from html5lib import sanitizer
    from html5lib import treebuilders
    import urllib
    
    url = 'http://the-url-to-scrape'
    fp = urllib.urlopen(url)
    
    # Create an html5lib parser. Not sure if the sanitizer is required.
    parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"), tokenizer=sanitizer.HTMLSanitizer)
    # Load the source file's HTML into html5lib
    html5lib_object = parser.parse(file_pointer)
    # In theory we shouldn't need to convert this to a string before passing to BS. Didn't work passing directly to BS for me however.
    html_string = str(html5lib_object)
    
    # Load the string into BeautifulSoup for parsing.
    soup = BeautifulSoup(html_string)
    
    for content in soup.findAll('div'):
        print content
    

    If you have any questions about this code or need a little more specific guidance, just let me know. :)

提交回复
热议问题