How to determine the encoding of text?

前端 未结 10 1484
一向
一向 2020-11-21 07:47

I received some text that is encoded, but I don\'t know what charset was used. Is there a way to determine the encoding of a text file using Python? How can I detect the enc

相关标签:
10条回答
  • 2020-11-21 08:13

    Some encoding strategies, please uncomment to taste :

    #!/bin/bash
    #
    tmpfile=$1
    echo '-- info about file file ........'
    file -i $tmpfile
    enca -g $tmpfile
    echo 'recoding ........'
    #iconv -f iso-8859-2 -t utf-8 back_test.xml > $tmpfile
    #enca -x utf-8 $tmpfile
    #enca -g $tmpfile
    recode CP1250..UTF-8 $tmpfile
    

    You might like to check the encoding by opening and reading the file in a form of a loop... but you might need to check the filesize first :

    encodings = ['utf-8', 'windows-1250', 'windows-1252' ...etc]
                for e in encodings:
                    try:
                        fh = codecs.open('file.txt', 'r', encoding=e)
                        fh.readlines()
                        fh.seek(0)
                    except UnicodeDecodeError:
                        print('got unicode error with %s , trying different encoding' % e)
                    else:
                        print('opening the file with encoding:  %s ' % e)
                        break              
    
    0 讨论(0)
  • 2020-11-21 08:13

    It is, in principle, impossible to determine the encoding of a text file, in the general case. So no, there is no standard Python library to do that for you.

    If you have more specific knowledge about the text file (e.g. that it is XML), there might be library functions.

    0 讨论(0)
  • 2020-11-21 08:15
    # Function: OpenRead(file)
    
    # A text file can be encoded using:
    #   (1) The default operating system code page, Or
    #   (2) utf8 with a BOM header
    #
    #  If a text file is encoded with utf8, and does not have a BOM header,
    #  the user can manually add a BOM header to the text file
    #  using a text editor such as notepad++, and rerun the python script,
    #  otherwise the file is read as a codepage file with the 
    #  invalid codepage characters removed
    
    import sys
    if int(sys.version[0]) != 3:
        print('Aborted: Python 3.x required')
        sys.exit(1)
    
    def bomType(file):
        """
        returns file encoding string for open() function
    
        EXAMPLE:
            bom = bomtype(file)
            open(file, encoding=bom, errors='ignore')
        """
    
        f = open(file, 'rb')
        b = f.read(4)
        f.close()
    
        if (b[0:3] == b'\xef\xbb\xbf'):
            return "utf8"
    
        # Python automatically detects endianess if utf-16 bom is present
        # write endianess generally determined by endianess of CPU
        if ((b[0:2] == b'\xfe\xff') or (b[0:2] == b'\xff\xfe')):
            return "utf16"
    
        if ((b[0:5] == b'\xfe\xff\x00\x00') 
                  or (b[0:5] == b'\x00\x00\xff\xfe')):
            return "utf32"
    
        # If BOM is not provided, then assume its the codepage
        #     used by your operating system
        return "cp1252"
        # For the United States its: cp1252
    
    
    def OpenRead(file):
        bom = bomType(file)
        return open(file, 'r', encoding=bom, errors='ignore')
    
    
    #######################
    # Testing it
    #######################
    fout = open("myfile1.txt", "w", encoding="cp1252")
    fout.write("* hi there (cp1252)")
    fout.close()
    
    fout = open("myfile2.txt", "w", encoding="utf8")
    fout.write("\u2022 hi there (utf8)")
    fout.close()
    
    # this case is still treated like codepage cp1252
    #   (User responsible for making sure that all utf8 files
    #   have a BOM header)
    fout = open("badboy.txt", "wb")
    fout.write(b"hi there.  barf(\x81\x8D\x90\x9D)")
    fout.close()
    
    # Read Example file with Bom Detection
    fin = OpenRead("myfile1.txt")
    L = fin.readline()
    print(L)
    fin.close()
    
    # Read Example file with Bom Detection
    fin = OpenRead("myfile2.txt")
    L =fin.readline() 
    print(L) #requires QtConsole to view, Cmd.exe is cp1252
    fin.close()
    
    # Read CP1252 with a few undefined chars without barfing
    fin = OpenRead("badboy.txt")
    L =fin.readline() 
    print(L)
    fin.close()
    
    # Check that bad characters are still in badboy codepage file
    fin = open("badboy.txt", "rb")
    fin.read(20)
    fin.close()
    
    0 讨论(0)
  • 2020-11-21 08:21

    Here is an example of reading and taking at face value a chardet encoding prediction, reading n_lines from the file in the event it is large.

    chardet also gives you a probability (i.e. confidence) of it's encoding prediction (haven't looked how they come up with that), which is returned with its prediction from chardet.predict(), so you could work that in somehow if you like.

    def predict_encoding(file_path, n_lines=20):
        '''Predict a file's encoding using chardet'''
        import chardet
    
        # Open the file as binary data
        with open(file_path, 'rb') as f:
            # Join binary lines for specified number of lines
            rawdata = b''.join([f.readline() for _ in range(n_lines)])
    
        return chardet.detect(rawdata)['encoding']
    
    0 讨论(0)
提交回复
热议问题