Transcoding characters on-the-fly using iostreams and ICU

旧巷老猫 提交于 2020-01-06 08:36:13

问题


I'd like to transcode character encoding on-the-fly. I'd like to use iostreams and my own transcoding streambuf, e.g.:

xcoder_streambuf xbuf( "UTF-8", "ISO-8859-1", cout.rdbuf() );
cout.rdbuf( &xbuf );

char *utf8_s;    // pointer to buffer containing UTF-8 encoded characters
// ...
cout << utf8_s;  // characters are written in ISO-8859-1

The implementation of xcoder_streambuf would use ICU's converters API. It would take the data coming in (in this case, from utf8_s), transcode it, and write it out using the iostream's original steambuf.

Is that a reasonable way to go? If not, what would be better?


回答1:


Is that a reasonable way to go?

Yes, but it is not the way you are expected to do it in modern (as in 1997) iostream.

The behaviour of outputting through basic_streambuf<> is defined by the overflow(int_type c) virtual function.

The description of basic_filebuf<>::overflow(int_type c = traits::eof()) includes a_codecvt.out(state, b, p, end, xbuf, xbuf+XSIZE, xbuf_end); where a_codecvt is defined as:

const codecvt<charT,char,typename traits::state_type>& a_codecvt 
     = use_facet<codecvt<charT,char,typename traits::state_type> >(getloc());

so you are expected to imbue a locale with the appropriate codecvt<charT,char,typename traits::state_type> converter.

The class codecvt<internT,externT,stateT> is for use when converting from one character encoding to another, such as from wide characters to multibyte characters or between wide character encodings such as Unicode and EUC.

The standard library support for Unicode made some progress since 1997:

the specialization codecvt converts between the UTF-32 and UTF-8 encoding schemes.

This seems what you want (ISO-8859-1 codes are USC-4 codes = UTF-32).

If not, what would be better?

I would introduce a different type for UTF8, like:

struct utf8 {
    unsigned char d; // d for data
};

struct latin1 {
    unsigned char c; // c for character 
};

This way you cannot accidentally pass UTF8 where ISO-8859-* is expected. But then you would have to write some interface code, and the type of your streams won't be istream/ostream.

Disclaimer: I never actually did such a thing, so I don't know if it is workable in practice.



来源:https://stackoverflow.com/questions/8453546/transcoding-characters-on-the-fly-using-iostreams-and-icu

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!