BeautifulSoup return unexpected extra spaces

后端 未结 2 1499
野性不改
野性不改 2020-12-03 10:48

I am trying to grab some text from html documents with BeautifulSoup. In a very relavant case for me, it originates a strange and interesting result: after a certain point,

相关标签:
2条回答
  • 2020-12-03 11:26

    You can specify the parser as html.parser:

    soup = BeautifulSoup(prova, 'html.parser')
    

    Also you can specify the html5 parser:

    soup = BeautifulSoup(prova, 'html5')
    

    Haven't installed the html5 parser yet? Install it from terminal:

    sudo apt-get install python-html5lib
    

    The xml parser may be used (soup = BeautifulSoup(prova, 'xml')) but you may see some differences in multi-valued attributes like class="foo bar".

    0 讨论(0)
  • 2020-12-03 11:33

    I believe this is a bug with Lxml's HTML parser. Try:

    from bs4 import BeautifulSoup
    
    import urllib2
    html = urllib2.urlopen ("http://www.beppegrillo.it")
    prova = html.read()
    soup = BeautifulSoup(prova.replace('ISO-8859-1', 'utf-8'))
    print soup
    

    Which is a workaround for the problem. I believe the issue was fixed in lxml 3.0 alpha 2 and lxml 2.3.6, so it could be worth checking whether you need to upgrade to a newer version.

    If you want more info on the bug it was initially filed here:

    https://bugs.launchpad.net/beautifulsoup/+bug/972466

    Hope this helps,

    Hayden

    0 讨论(0)
提交回复
热议问题