How to determine the encoding of text?

前端 未结 10 1420
一向
一向 2020-11-21 07:47

I received some text that is encoded, but I don\'t know what charset was used. Is there a way to determine the encoding of a text file using Python? How can I detect the enc

10条回答
  •  南笙
    南笙 (楼主)
    2020-11-21 08:11

    This site has python code for recognizing ascii, encoding with boms, and utf8 no bom: https://unicodebook.readthedocs.io/guess_encoding.html. Read file into byte array (data): http://www.codecodex.com/wiki/Read_a_file_into_a_byte_array. Here's an example. I'm in osx.

    #!/usr/bin/python                                                                                                  
    
    import sys
    
    def isUTF8(data):
        try:
            decoded = data.decode('UTF-8')
        except UnicodeDecodeError:
            return False
        else:
            for ch in decoded:
                if 0xD800 <= ord(ch) <= 0xDFFF:
                    return False
            return True
    
    def get_bytes_from_file(filename):
        return open(filename, "rb").read()
    
    filename = sys.argv[1]
    data = get_bytes_from_file(filename)
    result = isUTF8(data)
    print(result)
    
    
    PS /Users/js> ./isutf8.py hi.txt                                                                                     
    True
    

提交回复
热议问题