How to fix broken utf-8 encoding in Python?

前端 未结 3 1857
感动是毒
感动是毒 2021-02-08 13:48

My string is Niệm Bồ Tát (Thiá»n sÆ° Nhất Hạnh) and I want to decode it to Niệm Bồ Tát (Thiền sư Nhất Hạnh). I see in that site can do that ht

相关标签:
3条回答
  • 2021-02-08 14:35

    Try:

    str.encode('ascii', 'ignore').decode('utf-8')

    You're encoding the string in ASCII format / ignoring the errors and decoding in UTF-8. This may remove the accents, but it's one approach.

    0 讨论(0)
  • 2021-02-08 14:39

    I'm not sure what you can do with these kind of data, but for your example in your original post, this works:

    >>> mystr = '09. Bát Nhã Tâm Kinh'
    >>> s = mystr.decode('utf8').encode('latin1').decode('utf8')
    >>> s
    u'09. B\xe1t Nh\xe3 T\xe2m Kinh'
    >>> print(s)
    09. Bát Nhã Tâm Kinh
    
    0 讨论(0)
  • 2021-02-08 14:44

    The only thing that helped me with broken cyrillic string - https://github.com/LuminosoInsight/python-ftfy

    This module fixes pretty much everything and works much better than online decoders.

    >>> from ftfy import fix_encoding
    >>> mystr = '09. Bát Nhã Tâm Kinh'
    >>> fix_encoding(mystr)
    '09. Bát Nhã Tâm Kinh'
    

    It can be easily installed using pip install ftfy

    0 讨论(0)
提交回复
热议问题