Python(2.6) cStringIO unicode support?

谁说胖子不能爱 提交于 2020-01-02 06:33:13

问题


I'm using python pycurl module to download content from various web pages. Since I also wanted to support potential unicode text I've been avoiding the cStringIO.StringIO function which according to python docs: cStringIO - Faster version of StringIO

Unlike the StringIO module, this module is not able to accept Unicode strings that cannot be encoded as plain ASCII strings.

... does not support unicode strings. Actually it states that it does not support unicode strings that can not be converted to ASCII strings. Can someone please clarify this to me? Which can and which can not be converted?

I've tested with the following code and it seems to work just fine with unicode:

import pycurl
import cStringIO

downloadedContent = cStringIO.StringIO()
curlHandle = pycurl.Curl()
curlHandle.setopt(pycurl.WRITEFUNCTION, downloadedContent.write)
curlHandle.setopt(pycurl.URL, 'http://www.ltg.ed.ac.uk/~richard/unicode-sample.html')

curlHandle.perform()
content = downloadedContent.getvalue()

fileHandle = open('unicode-test.txt','w')
for char in content:
    fileHandle.write(char)

And the file is correctly written. I can even print the whole content in the console, all characters show up fine... So what I'm puzzled about is, where does the cStringIO fail ? Is there any reason why I should not use it?

[Note: I'm using Python 2.6 and need to stick to this version]


回答1:


Any text that only uses ASCII codepoints (byte values 00-7F hexadecimal) can be converted to ASCII. Basically any text that uses characters not often used in American English is not ASCII.

In your example code, you are not converting the input to Unicode text; you are treating it as un-decoded bytes. The test page in question is encoded in UTF-8, and you never decode that to Unicode.

If you were to decode the value to a Unicode string, you won't be able to store that string in a cStringIO object.

You may want to read up on the difference between Unicode and text encodings such as ASCII and UTF-8. I can recommend:

  • Joel Spolsky's minimum Unicode article
  • The Python Unicode HOWTO.


来源:https://stackoverflow.com/questions/12801166/python2-6-cstringio-unicode-support

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!