How to convert a C string (char array) into a Python string when there are non-ASCII characters in the string?

后端 未结 3 1235
星月不相逢
星月不相逢 2020-12-19 06:06

I have embedded a Python interpreter in a C program. Suppose the C program reads some bytes from a file into a char array and learns (somehow) that the bytes represent text

相关标签:
3条回答
  • 2020-12-19 06:34

    Try calling PyErr_Print() in the "if (!py_string)" clause. Perhaps the python exception will give you some more information.

    0 讨论(0)
  • 2020-12-19 06:50

    PyString_Decode does this:

    PyObject *PyString_Decode(const char *s,
                  Py_ssize_t size,
                  const char *encoding,
                  const char *errors)
    {
        PyObject *v, *str;
    
        str = PyString_FromStringAndSize(s, size);
        if (str == NULL)
        return NULL;
        v = PyString_AsDecodedString(str, encoding, errors);
        Py_DECREF(str);
        return v;
    }
    

    IOW, it does basically what you're doing in your second example - converts to a string, then decode the string. The problem here arises from PyString_AsDecodedString, rather than PyString_AsDecodedObject. PyString_AsDecodedString does PyString_AsDecodedObject, but then tries to convert the resulting unicode object into a string object with the default encoding (for you, looks like that's ASCII). That's where it fails.

    I believe you'll need to do two calls - but you can use PyString_AsDecodedObject rather than calling the python "decode" method. Something like:

    #include <Python.h>
    #include <stdio.h>
    
    int main(int argc, char *argv[])
    {
         char c_string[] = { (char)0x93, 0 };
         PyObject *py_string, *py_unicode;
    
         Py_Initialize();
    
         py_string = PyString_FromStringAndSize(c_string, 1);
         if (!py_string) {
              PyErr_Print();
              return 1;
         }
         py_unicode = PyString_AsDecodedObject(py_string, "windows_1252", "replace");
         Py_DECREF(py_string);
    
         return 0;
    }
    

    I'm not entirely sure what the reasoning behind PyString_Decode working this way is. A very old thread on python-dev seems to indicate that it has something to do with chaining the output, but since the Python methods don't do the same, I'm not sure if that's still relevant.

    0 讨论(0)
  • 2020-12-19 06:52

    You don't want to decode the string into a Unicode representation, you just want to treat it as an array of bytes, right?

    Just use PyString_FromString:

    char *cstring;
    PyObject *pystring = PyString_FromString(cstring);
    

    That's all. Now you have a Python str() object. See docs here: https://docs.python.org/2/c-api/string.html

    I'm a little bit confused about how to specify "str" or "unicode." They are quite different if you have non-ASCII characters. If you want to decode a C string and you know exactly what character set it's in, then yes, PyString_DecodeString is a good place to start.

    0 讨论(0)
提交回复
热议问题