converting utf-16 -> utf-8 AND remove BOM

匿名 (未验证) 提交于 2019-12-03 01:52:01

问题:

We have a data entry person who encoded in UTF-16 on Windows and would like to have utf-8 and remove the BOM. The utf-8 conversion works but BOM is still there. How would I remove this? This is what I currently have:

batch_3={'src':'/Users/jt/src','dest':'/Users/jt/dest/'} batches=[batch_3]  for b in batches:   s_files=os.listdir(b['src'])   for file_name in s_files:     ff_name = os.path.join(b['src'], file_name)       if (os.path.isfile(ff_name) and ff_name.endswith('.json')):       print ff_name       target_file_name=os.path.join(b['dest'], file_name)       BLOCKSIZE = 1048576       with codecs.open(ff_name, "r", "utf-16-le") as source_file:         with codecs.open(target_file_name, "w+", "utf-8") as target_file:           while True:             contents = source_file.read(BLOCKSIZE)             if not contents:               break             target_file.write(contents) 

If I hexdump -C I see:

Wed Jan 11$ hexdump -C svy-m-317.json  00000000  ef bb bf 7b 0d 0a 20 20  20 20 22 6e 61 6d 65 22  |...{..    "name"| 00000010  3a 22 53 61 76 6f 72 79  20 4d 61 6c 69 62 75 2d  |:"Savory Malibu-| 

in the resulting file. How do I remove the BOM?

thx

回答1:

Just use str.decode and str.encode:

with open(ff_name, 'rb') as source_file:   with open(target_file_name, 'w+b') as dest_file:     contents = source_file.read()     dest_file.write(contents.decode('utf-16').encode('utf-8')) 

str.decode will get rid of the BOM for you (and deduce the endianness).



回答2:

This is the difference between UTF-16LE and UTF-16

  • UTF-16LE is little endian without a BOM
  • UTF-16 is big or little endian with a BOM

So when you use UTF-16LE, the BOM is just part of the text. Use UTF-16 instead, so the BOM is automatically removed. The reason UTF-16LE and UTF-16BE exist is so people can carry around "properly-encoded" text without BOMs, which does not apply to you.

Note what happens when you encode using one encoding and decode using the other. (UTF-16 automatically detects UTF-16LE sometimes, not always.)

>>> u'Hello, world'.encode('UTF-16LE') 'H\x00e\x00l\x00l\x00o\x00,\x00 \x00w\x00o\x00r\x00l\x00d\x00' >>> u'Hello, world'.encode('UTF-16') '\xff\xfeH\x00e\x00l\x00l\x00o\x00,\x00 \x00w\x00o\x00r\x00l\x00d\x00'  ^^^^^^^^ (BOM)  >>> u'Hello, world'.encode('UTF-16LE').decode('UTF-16') u'Hello, world' >>> u'Hello, world'.encode('UTF-16').decode('UTF-16LE') u'\ufeffHello, world'     ^^^^ (BOM) 

Or you can do this at the shell:

for x in * ; do iconv -f UTF-16 -t UTF-8 "$x.tmp" && mv "$x.tmp" "$x"; done 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!