How to determine the encoding of text?

前端 未结 10 1482
一向
一向 2020-11-21 07:47

I received some text that is encoded, but I don\'t know what charset was used. Is there a way to determine the encoding of a text file using Python? How can I detect the enc

相关标签:
10条回答
  • 2020-11-21 07:59

    If you know the some content of the file you can try to decode it with several encoding and see which is missing. In general there is no way since a text file is a text file and those are stupid ;)

    0 讨论(0)
  • 2020-11-21 08:01

    Another option for working out the encoding is to use libmagic (which is the code behind the file command). There are a profusion of python bindings available.

    The python bindings that live in the file source tree are available as the python-magic (or python3-magic) debian package. It can determine the encoding of a file by doing:

    import magic
    
    blob = open('unknown-file', 'rb').read()
    m = magic.open(magic.MAGIC_MIME_ENCODING)
    m.load()
    encoding = m.buffer(blob)  # "utf-8" "us-ascii" etc
    

    There is an identically named, but incompatible, python-magic pip package on pypi that also uses libmagic. It can also get the encoding, by doing:

    import magic
    
    blob = open('unknown-file', 'rb').read()
    m = magic.Magic(mime_encoding=True)
    encoding = m.from_buffer(blob)
    
    0 讨论(0)
  • 2020-11-21 08:07

    Depending on your platform, I just opt to use the linux shell file command. This works for me since I am using it in a script that exclusively runs on one of our linux machines.

    Obviously this isn't an ideal solution or answer, but it could be modified to fit your needs. In my case I just need to determine whether a file is UTF-8 or not.

    import subprocess
    file_cmd = ['file', 'test.txt']
    p = subprocess.Popen(file_cmd, stdout=subprocess.PIPE)
    cmd_output = p.stdout.readlines()
    # x will begin with the file type output as is observed using 'file' command
    x = cmd_output[0].split(": ")[1]
    return x.startswith('UTF-8')
    
    0 讨论(0)
  • 2020-11-21 08:08

    This might be helpful

    from bs4 import UnicodeDammit
    with open('automate_data/billboard.csv', 'rb') as file:
       content = file.read()
    
    suggestion = UnicodeDammit(content)
    suggestion.original_encoding
    #'iso-8859-1'
    
    0 讨论(0)
  • 2020-11-21 08:09

    Correctly detecting the encoding all times is impossible.

    (From chardet FAQ:)

    However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.

    There is the chardet library that uses that study to try to detect encoding. chardet is a port of the auto-detection code in Mozilla.

    You can also use UnicodeDammit. It will try the following methods:

    • An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
    • An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
    • An encoding sniffed by the chardet library, if you have it installed.
    • UTF-8
    • Windows-1252
    0 讨论(0)
  • 2020-11-21 08:11

    This site has python code for recognizing ascii, encoding with boms, and utf8 no bom: https://unicodebook.readthedocs.io/guess_encoding.html. Read file into byte array (data): http://www.codecodex.com/wiki/Read_a_file_into_a_byte_array. Here's an example. I'm in osx.

    #!/usr/bin/python                                                                                                  
    
    import sys
    
    def isUTF8(data):
        try:
            decoded = data.decode('UTF-8')
        except UnicodeDecodeError:
            return False
        else:
            for ch in decoded:
                if 0xD800 <= ord(ch) <= 0xDFFF:
                    return False
            return True
    
    def get_bytes_from_file(filename):
        return open(filename, "rb").read()
    
    filename = sys.argv[1]
    data = get_bytes_from_file(filename)
    result = isUTF8(data)
    print(result)
    
    
    PS /Users/js> ./isutf8.py hi.txt                                                                                     
    True
    
    0 讨论(0)
提交回复
热议问题