Cross-platform iteration of Unicode string (counting Graphemes using ICU)

后端 未结 3 1778
南旧
南旧 2020-12-02 19:20

I want to iterate each character of a Unicode string, treating each surrogate pair and combining character sequence as a single unit (one grapheme).

<
相关标签:
3条回答
  • ICU has a very old interface, Boost.Locale is much better:

    #include <iostream>
    #include <string_view>
    
    #include <boost/locale.hpp>
    
    using namespace std::string_view_literals;
    
    int main()
    {
        boost::locale::generator gen;
        auto string = "noël                                                                     
    0 讨论(0)
  • 2020-12-02 20:01

    Glib's ustring class gives you utf-8 strings, if using utf-8 is ok for you. It is designed to be similar to std::string. Since utf-8 is native for Linux, your task is quite easy:

    int main()
    {
        Glib::ustring s = L"नमस्ते";
        cout << s.size();
    }
    

    you can also iterate on string's characters as usual with Glib::ustring::iterator

    0 讨论(0)
  • 2020-12-02 20:02

    You should be able to use the ICU BreakIterator for this (the character instance assuming it is feature-equivalent to the Java version).

    0 讨论(0)
提交回复
热议问题