Python - Decode UTF-16 file with BOM

前端 未结 2 1585
孤城傲影
孤城傲影 2020-12-29 09:58

I have a UTF-16 LE file with BOM. I\'d like to flip this file in to UTF-8 without BOM so I can parse it using Python.

The usual code that I use didn\'t do the trick,

2条回答
  •  孤城傲影
    2020-12-29 10:45

    Firstly, you should read in binary mode, otherwise things will get confusing.

    Then, check for and remove the BOM, since it is part of the file, but not part of the actual text.

    import codecs
    encoded_text = open('dbo.chrRaces.Table.sql', 'rb').read()    #you should read in binary mode to get the BOM correctly
    bom= codecs.BOM_UTF16_LE                                      #print dir(codecs) for other encodings
    assert encoded_text.startswith(bom)                           #make sure the encoding is what you expect, otherwise you'll get wrong data
    encoded_text= encoded_text[len(bom):]                         #strip away the BOM
    decoded_text= encoded_text.decode('utf-16le')                 #decode to unicode
    

    Don't encode (to utf-8 or otherwise) until you're done with all parsing/processing. You should do all that using unicode strings.

    Also, errors='ignore' on decode may be a bad idea. Consider what's worse: having your program tell you something is wrong and stop, or returning wrong data?

提交回复
热议问题