UTF8 Python BOM [duplicate]

问题

Possible Duplicate:
Write to utf-8 file in python

I have Unicode strings (with Japanese characters) I want to write to a CSV file. However, the BOM does not seem to be written correctly, just as a string "ï»¿" in the first line. This leads to Excel not displaying the Japanese characters correctly. When opening the CSV with Notepad++, the characters are displayed correctly.

fileObj = codecs.open(filename,"w",'utf-8')
fileObj.write(codecs.BOM_UTF8)
c = u';'
for s in stringsToWrite:
   line = e.someUnicodeString
   fileObj.write(line)
fileObj.close()

回答1:

fileObj = codecs.open(filename,"w",'utf-8')

OK, you have a Unicode output stream.

fileObj.write(codecs.BOM_UTF8)

BOM_UTF8 is a sequence of bytes, not a Unicode string as you would expect to write to a Unicode stream. Python will automatically convert from bytes to Unicode using some encoding which may not be the correct one. If the default encoding is Windows code page 1252 rather than UTF-8, you'll be effectively double-encoding the BOM and it will come as the UTF-8 encoding of ï»¿.

Suggest writing the BOM as the Unicode character it is instead:

fileObj.write(u'\uFEFF')

InternetSeriousBusiness wrote:

Isn't the UTF-8 BOM discouraged, anyway? –

Yes, the UTF-8 faux-BOM is largely a disaster in most contexts, but it is needed to get Excel's charset guessing to pick up UTF-8. Unfortunately it doesn't work in Excel for Mac. Another possible approach might be to use UTF-16.

回答2:

The string you copied is the UTF-8 BOM. So your problem is not in your python code but somewhere else.

来源：https://stackoverflow.com/questions/12180376/utf8-python-bom

标签

python

unicode

utf-8

byte-order-mark