Unable to extract Unicode symbols from C++ std::string

问题

I am looking to read a C++ std::string, then passing that std::string to a function which would analyse it, then extract Unicode symbols & simple ASCII symbols from it.

I searched many tutorials online, but all of them mentioned that standard C++ does not fully support Unicode format. Many of them mentioned to use ICU C++.

This is my C++ program for understanding the very basic of above functionalities. It reads the raw string, converts to ICU Unicode String & prints that:

#include <iostream>
#include <string>
#include "unicode/unistr.h"

int main()
{
    std::string s="Hello☺";
    // at this point s contains a line of text
    // which may be ANSI or UTF-8 encoded

    // convert std::string to ICU's UnicodeString
    icu::UnicodeString ucs = icu::UnicodeString::fromUTF8(icu::StringPiece(s.c_str()));

    // convert UnicodeString to std::wstring
    std::wstring ws;
    for (int i = 0; i < ucs.length(); ++i)
      ws += static_cast<wchar_t>(ucs[i]);

    std::wcout << ws << std::endl;
}

Expected Output:

Hello☺

Actual Output:

Hello?

Please suggest what am I doing wrong. Also suggest any alternative/simpler approaches

Thanks

Update 1 (Older): The working code is as follows:

#include <iostream>
#include <string>
#include <locale>
#include "unicode/unistr.h"

void f(const std::string & s)
{
  std::wcout << "Inside called function" << std::endl;
  constexpr char locale_name[] = "";
  setlocale( LC_ALL, locale_name );
  std::locale::global(std::locale(locale_name));
  std::ios_base::sync_with_stdio(false);
  std::wcin.imbue(std::locale());
  std::wcout.imbue(std::locale());

  // at this point s contains a line of text which may be ANSI or UTF-8 encoded

  // convert std::string to ICU's UnicodeString
  icu::UnicodeString ucs = icu::UnicodeString::fromUTF8(icu::StringPiece(s.c_str()));

  // convert UnicodeString to std::wstring
  std::wstring ws;
  for (int i = 0; i < ucs.length(); ++i)
    ws += static_cast<wchar_t>(ucs[i]);

  std::wcout << ws << std::endl;
}

int main()
{
    constexpr char locale_name[] = "";
    setlocale( LC_ALL, locale_name );
    std::locale::global(std::locale(locale_name));
    std::ios_base::sync_with_stdio(false);
    std::wcin.imbue(std::locale());
    std::wcout.imbue(std::locale());

    std::wcout << "Inside main function" << std::endl;

    std::string s=u8"hello☺";
    // at this point s contains a line of text which may be ANSI or UTF-8 encoded

    // convert std::string to ICU's UnicodeString
    icu::UnicodeString ucs = icu::UnicodeString::fromUTF8(icu::StringPiece(s.c_str()));

    // convert UnicodeString to std::wstring
    std::wstring ws;
    for (int i = 0; i < ucs.length(); ++i)
      ws += static_cast<wchar_t>(ucs[i]);

    std::wcout << ws << std::endl;
    std::wcout << "--------------------------------" << std::endl;
    f(s);
    return 0;
}

Now, both the expected output & actual output are same, i.e.:

Inside main function
hello☺
--------------------------------
Inside called function
hello☺

Update 2 (Latest): The code mentioned in Update 1 does not work for UTF32 symbols like 😆. So, the working code for all possible Unicode symbols is as follows. Special thanks to @Botje for his solution. I wish I can give more than one tick to his solution!!! :)

#include <iostream>
#include <string>
#include <locale>
#include "unicode/unistr.h"
#include "unicode/ustream.h"

void f(const std::u32string & s)
{
  std::wcout << "INSIDE CALLED FUNCTION:" << std::endl;

  icu::UnicodeString ustr = icu::UnicodeString::fromUTF32(reinterpret_cast<const UChar32 *>(s.c_str()), s.size());
  std::cout << "Unicode string is: " << ustr << std::endl;

  std::cout << "Size of unicode string = " << ustr.countChar32() << std::endl;

  std::cout << "Individual characters of the string are:" << std::endl;
  for(int i=0; i < ustr.countChar32(); i++)
    std::cout << icu::UnicodeString(ustr.char32At(i)) << std::endl;

  std::cout << "--------------------------------" << std::endl;
}

int main()
{
    std::cout << "--------------------------------" << std::endl;
    constexpr char locale_name[] = "";
    setlocale( LC_ALL, locale_name );
    std::locale::global(std::locale(locale_name));
    std::ios_base::sync_with_stdio(false);
    std::wcin.imbue(std::locale());
    std::wcout.imbue(std::locale());

    std::wcout << "INSIDE MAIN FUNCTION:" << std::endl;

    std::u32string s=U"hello☺😆";

    icu::UnicodeString ustr = icu::UnicodeString::fromUTF32(reinterpret_cast<const UChar32 *>(s.c_str()), s.size());
    std::cout << "Unicode string is: " << ustr << std::endl;

    std::cout << "Size of unicode string = " << ustr.countChar32() << std::endl;

    std::cout << "Individual characters of the string are:" << std::endl;
    for(int i=0; i < ustr.countChar32(); i++)
      std::cout << icu::UnicodeString(ustr.char32At(i)) << std::endl;

    std::cout << "--------------------------------" << std::endl;
    f(s);
    return 0;
}

Now, both the expected output & actual output are same, i.e.:

--------------------------------
INSIDE MAIN FUNCTION:
Unicode string is: hello☺😆
Size of unicode string = 7
Individual characters of the string are:
h
e
l
l
o
☺
😆
--------------------------------
INSIDE CALLED FUNCTION:
Unicode string is: hello☺😆
Size of unicode string = 7
Individual characters of the string are:
h
e
l
l
o
☺
😆
--------------------------------

回答1:

There are a number of stumbling blocks to get this right:

First, your file (and the smiley face in it) should be encoded as UTF-8. The smiley face should consist of the literal bytes 0xE2 0x98 0xBA.
You should mark the string as containing UTF-8 data using the u8 decorator: u8"Hello☺"
Next, the documentation of icu::UnicodeString remarks that it stores Unicode as UTF-16. In this case you are lucky, as U+263A fits in one UTF-16 character. Other emoji might not! You should either convert it to UTF-32, or be very careful and use the GetChar32At function.
Finally, the encoding used by wcout should be configured with imbue to match the encoding expected by your environment. See the answers to this question.

来源：https://stackoverflow.com/questions/60092291/unable-to-extract-unicode-symbols-from-c-stdstring

标签

c++

c++11

unicode

icu

icu4c