Convert ISO-8859-1 strings to UTF-8 in C/C++

后端 未结 6 872
礼貌的吻别
礼貌的吻别 2020-12-05 05:27

You would think this would be readily available, but I\'m having a hard time finding a simple library function that will convert a C or C++ string from ISO-8859-1 coding to

相关标签:
6条回答
  • 2020-12-05 05:30

    You can use the boost::locale library:

    http://www.boost.org/doc/libs/1_49_0/libs/locale/doc/html/charset_handling.html

    The code would look like this:

    #include <boost/locale.hpp>
    std::string utf8_string = to_utf<char>(latin1_string,"Latin1");
    
    0 讨论(0)
  • 2020-12-05 05:31

    The Unicode folks have some tables that might help if faced with Windows 1252 instead of true ISO-8859-1. The definitive one seems to be this one which maps every code point in CP1252 to a code point in Unicode. Encoding the Unicode as UTF-8 is a straightforward exercise.

    It would not be difficult to parse that table directly and form a lookup table from it at compile time.

    0 讨论(0)
  • 2020-12-05 05:36

    The C++03 standard does not provide functions to directly convert between specific charsets.

    Depending on your OS, you can use iconv() on Linux, MultiByteToWideChar() & Co. on Windows. A library which provides large support for string conversion is the ICU library which is open source.

    0 讨论(0)
  • 2020-12-05 05:37

    To c++ i use this:

    std::string iso_8859_1_to_utf8(std::string &str)
    {
        string strOut;
        for (std::string::iterator it = str.begin(); it != str.end(); ++it)
        {
            uint8_t ch = *it;
            if (ch < 0x80) {
                strOut.push_back(ch);
            }
            else {
                strOut.push_back(0xc0 | ch >> 6);
                strOut.push_back(0x80 | (ch & 0x3f));
            }
        }
        return strOut;
    }
    
    0 讨论(0)
  • 2020-12-05 05:37

    ISO-8859-1 to UTF-8 involves nothing more than the encoding algorithm because ISO-8859-1 is a subset of Unicode. So you already have the Unicode code points. Check Wikipedia for the algorithm.

    The C++ aspects -- integrating that with iostreams -- are much harder.

    I suggest you walk around that mountain instead of trying to drill through it or climb it, that is, implement a simple string to string converter.

    Cheers & hth.,

    0 讨论(0)
  • 2020-12-05 05:39

    If your source encoding will always be ISO-8859-1, this is trivial. Here's a loop:

    unsigned char *in, *out;
    while (*in)
        if (*in<128) *out++=*in++;
        else *out++=0xc2+(*in>0xbf), *out++=(*in++&0x3f)+0x80;
    

    For safety you need to ensure that the output buffer is twice as large as the input buffer, or else include a size limit and check it in the loop condition.

    0 讨论(0)
提交回复
热议问题