malformed start tag error - Python, BeautifulSoup, and Sipie - Ubuntu 10.04

前端 未结 5 1913
再見小時候
再見小時候 2021-02-14 23:31

I just installed python, mplayer, beautifulsoup and sipie to run Sirius on my Ubuntu 10.04 machine. I followed some docs that seem straightforward, but am encountering some iss

相关标签:
5条回答
  • 2021-02-15 00:08

    Suppose you are using BeautifulSoup4, I found out something in the official document about this: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

    If you’re using a version of Python 2 earlier than 2.7.3, or a version of Python 3 earlier than 3.2.2, it’s essential that you install lxml or html5lib–Python’s built-in HTML parser is just not very good in older versions.

    I tried this and it works well, just like what @Joshua

    soup = BeautifulSoup(r.text, 'html5lib')
    
    0 讨论(0)
  • 2021-02-15 00:14

    Command Line:

    $ pip install beautifulsoup4
    $ pip install html5lib
    

    Python 3:

    from bs4 import BeautifulSoup
    from urllib.request import urlopen
    
    url = 'http://www.example.com'
    page = urlopen(url)
    soup = BeautifulSoup(page.read(), 'html5lib')
    links = soup.findAll('a')
    
    for link in links:
        print(link.string, link['href'])
    
    0 讨论(0)
  • 2021-02-15 00:20

    Look at column 3 of line 100 in the "data" that is mentioned in File "/usr/bin/Sipie/Sipie/Factory.py", line 298

    0 讨论(0)
  • 2021-02-15 00:25

    Newer versions of BeautifulSoup uses HTMLParser rather than SGMLParser (due to SGMLParser being removed from the Python 3.0 standard library). As a result, BeautifulSoup can no longer process many malformed HTML documents correctly, which is what I believe you are encountering here.

    A solution to your problem is likely to be to uninstall BeautifulSoup, and install an older version (which will still work with Python 2.6 on Ubuntu 10.04LTS):

    sudo apt-get remove python-beautifulsoup
    sudo easy_install -U "BeautifulSoup==3.0.7a"
    

    Just be aware that this temporary solution will no longer work with Python 3.0 (which may become the default in future versions of Ubuntu).

    0 讨论(0)
  • 2021-02-15 00:31

    The issues you are encountering are pretty common, and they deal specifically with mal-formed HTML. In my case, there was an HTML element which had double quoted an attribute's value. I ran into this issue today actually, and in so doing so came across your post. I was FINALLY able to resolve this issue through parsing the HTML through html5lib before handing it off the BeautifulSoup 4.

    First off, you'll need to:

    sudo easy_install bs4
    sudo apt-get install python-html5lib
    

    Then, run this example code:

    from bs4 import BeautifulSoup
    import html5lib
    from html5lib import sanitizer
    from html5lib import treebuilders
    import urllib
    
    url = 'http://the-url-to-scrape'
    fp = urllib.urlopen(url)
    
    # Create an html5lib parser. Not sure if the sanitizer is required.
    parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"), tokenizer=sanitizer.HTMLSanitizer)
    # Load the source file's HTML into html5lib
    html5lib_object = parser.parse(file_pointer)
    # In theory we shouldn't need to convert this to a string before passing to BS. Didn't work passing directly to BS for me however.
    html_string = str(html5lib_object)
    
    # Load the string into BeautifulSoup for parsing.
    soup = BeautifulSoup(html_string)
    
    for content in soup.findAll('div'):
        print content
    

    If you have any questions about this code or need a little more specific guidance, just let me know. :)

    0 讨论(0)
提交回复
热议问题