How to use Unicode range in C++ regex

后端 未结 1 526
醉梦人生
醉梦人生 2020-12-01 21:58

I have to use unicode range in a regex in C++. Basically what I need is to have a regex to accept all valid unicode characters..I just tried with the test expression and fac

相关标签:
1条回答
  • 2020-12-01 22:58

    This should work fine but you need to use std::wregex and std::wsmatch. You will need to convert the source string and regular expression to wide character unicode (UTF-32 on Linux, UTF-16(ish) on Windows) to make it work.

    This works for me where source text is UTF-8:

    inline std::wstring from_utf8(const std::string& utf8)
    {
        // code to convert from utf8 to utf32/utf16
    }
    
    inline std::string to_utf8(const std::wstring& ws)
    {
        // code to convert from utf32/utf16 to utf8
    }
    
    int main()
    {
        std::string test = "john.doe@神谕.com"; // utf8
        std::string expr = "[\\u0080-\\uDB7F]+"; // utf8
    
        std::wstring wtest = from_utf8(test);
        std::wstring wexpr = from_utf8(expr);
    
        std::wregex we(wexpr);
        std::wsmatch wm;
        if(std::regex_search(wtest, wm, we))
        {
            std::cout << to_utf8(wm.str(0)) << '\n';
        }
    }
    

    Output:

    神谕
    

    Note: If you need a UTF conversion library I used THIS ONE in the example above.

    Edit: Or, you could use the functions given in this answer:

    Any good solutions for C++ string code point and code unit?

    0 讨论(0)
提交回复
热议问题