Unbaking mojibake

流过昼夜 提交于 2019-11-29 08:39:54
galinden

You could use chardet (install with pip):

import chardet

your_str = "Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb"
detected_encoding = chardet.detect(your_str)["encoding"]

try:
    correct_str = your_str.decode(detected_encoding)
except UnicodeDecodeError:
    print("Could not estimate encoding")

Result: 時間試験観点(アニメパス)_10秒 (no idea if this could be correct or not)

For Python 3 (source file encoded as utf8):

import chardet
import codecs

falsely_decoded_str = "Ä×èÈÄÄî¦è¤ô_üiâAâjâüâpâXüj_10òb"

try:
    encoded_str = falsely_decoded_str.encode("cp850")
except UnicodeEncodeError:
    print("could not encode falsely decoded string")
    encoded_str = None

if encoded_str:
    detected_encoding = chardet.detect(encoded_str)["encoding"]

    try:
        correct_str = encoded_str.decode(detected_encoding)
    except UnicodeEncodeError:
        print("could not decode encoded_str as %s" % detected_encoding)

    with codecs.open("output.txt", "w", "utf-8-sig") as out:
        out.write(correct_str)

In summary:

>>> s = 'Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png'
>>> s.encode('cp850').decode('shift-jis')
'時間試験観点(アニメパス)_10秒.png'
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!