Creating UTF-16 newline characters in Python for Windows Notepad

可紊 提交于 2019-12-11 11:40:15

问题


In Python 2.7 running in Ubuntu this code:

f = open("testfile.txt", "w")
f.write("Line one".encode("utf-16"))
f.write(u"\r\n".encode("utf-16"))
f.write("Line two".encode("utf-16"))

produces the desired newline between the two lines of text when read in Gedit:

Line one
Line two

However, the same code executed in Windows 7 and read in Notepad produces unintelligible characters after "Line one" but no newline is recognized by Notepad. How can I write correct newline characters for UTF-16 in Windows to match the output I get in Ubuntu?

I am writing output for a Windows only application that only reads Unicode UTF-16. I've spent hours trying out different tips, but nothing seems to work for Notepad. It's worth mentioning that I can successfully convert a text file to UTF-16 right in the Notepad, but I'd rather have the script save the encoding correctly in the first place.


回答1:


The problem is that you're opening the file in text mode, but trying to use it as a binary file.

This:

u"\r\n".encode("utf-16")

… encodes to '\r\0\n\0'.

Then this:

f.write('\r\0\n\0')

… converts the Unix newline to a Windows newline, giving '\r\0\r\n\0'.

And that, of course, breaks your UTF-16 encoding. Besides the fact that the two \r\n bytes will decode into the valid but unassigned codepoint U+0A0D, that's an odd number of bytes, meaning you've got a leftover \0. So, instead of L\0 being the next character, it's \0L, aka , and so on.

On top of that, you're probably writing a new UTF-16 BOM for each encoded string. Most Windows apps will actually transparently handle that and ignore them, so all you're practically doing is wasting two bytes/line, but it isn't actually correct.


The quick fix to the first problem is to open the file in binary mode:

f = open("testfile.txt", "wb")

This doesn't fix the multiple-BOM problem, but it fixes the broken \n problem. If you want to fix the BOM problem, you either use a stateful encode, or you explicitly specify 'utf-16-le' (or 'utf-16-be') for all writes but the first write.


But the easy fix, for both problems, is to use the io module (or, for older Python 2.x, the codecs module) to do all the hard work for you:

f = io.open("testfile.txt", "w", encoding="utf-8")
f.write("Line one")
f.write(u"\r\n")
f.write("Line two")


来源:https://stackoverflow.com/questions/17159236/creating-utf-16-newline-characters-in-python-for-windows-notepad

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!