html5lib: TypeError: __init__() got an unexpected keyword argument 'encoding'

前端 未结 1 591
青春惊慌失措
青春惊慌失措 2020-12-07 03:14

I\'m trying to install html5lib. at first I tried to install the latest version (8 or 9 nines), but it came into conflict with my BeautifulSoup, so I decided to

相关标签:
1条回答
  • 2020-12-07 03:36

    I see something was broken in the latest versions of html5lib in regard to bs4, html5lib.treebuilders._base is no longer there, usng bs4 4.4.1 the latest compatible version seems to be the one with 7 nines, once you install it as below it works fine:

     pip3 install -U html5lib=="0.9999999"
    

    Tested using bs4 4.4.1:

    In [1]: import bs4
    
    In [2]: bs4.__version__
    Out[2]: '4.4.1'
    
    In [3]: import html5lib
    
    In [4]: html5lib.__version__
    Out[4]: '0.9999999'
    
    In [5]: from urllib.request import  urlopen
    
    In [6]: with urlopen("http://example.com/") as f:
       ...:         document = html5lib.parse(f, encoding=f.info().get_content_charset())
       ...:     
    
    In [7]: 
    

    You can see the change in this commit Rename treebuilders._base to .base to reflect public status the name was changed:

    The error you see is because you are still using the newest version, in html5lib/_inputstream.py, HTMLBinaryInputStream has no encoding arg:

    class HTMLBinaryInputStream(HTMLUnicodeInputStream):
        """Provides a unicode stream of characters to the HTMLTokenizer.
    
        This class takes care of character encoding and removing or replacing
        incorrect byte-sequences and also provides column and line tracking.
    
        """
    
        def __init__(self, source, override_encoding=None, transport_encoding=None,
                     same_origin_parent_encoding=None, likely_encoding=None,
                     default_encoding="windows-1252", useChardet=True):
    

    Setting override_encoding=f.info().get_content_charset() should do the trick.

    Also upgrading to the latest version of bs4 works fine with the latest version of html5lib:

    In [16]: bs4.__version__
    Out[16]: '4.5.1'
    
    In [17]: html5lib.__version__
    Out[17]: '0.999999999'
    
    In [18]: with urlopen("http://example.com/") as f:
                 document = html5lib.parse(f, override_encoding=f.info().get_content_charset())
       ....:     
    
    In [19]: 
    
    0 讨论(0)
提交回复
热议问题