Fixing mojibakes in UTF-8 text

前端 未结 1 1124
陌清茗
陌清茗 2021-01-22 00:25

I have a file with text in Portuguese in UTF-8. Somehow, who produced the file selected the wrong encoding, and the text is full of mojibake:

IDENTIFICAÌàÌÄO ins         


        
相关标签:
1条回答
  • 2021-01-22 00:38

    "André" instead of "André" is the Latin-1 interpretation of UTF-8 encoding. You can fix it by inverting the encoding/decoding:

    >>> 'André'.encode('latin-1').decode('utf-8')
    'André'
    

    All cases following this pattern can be fixed like that.

    However, I can't explain the other case (with "Ìà" for "ç" and "ÌÄ" for "ã"), and therefore can't provide a solution. If you can find a codec where "Ì", "à", and "Ä" have the codepoints C3, A7, and A3, respectively, then you can use this instead of Latin-1 for fixing the text.

    0 讨论(0)
提交回复
热议问题