how to detect “​” (combination of unicode) in c++ string

后端 未结 1 701
走了就别回头了
走了就别回头了 2021-01-14 19:28

I am trying to detect some of the combination of Unicode character (like ​) to cleanup the string, For a single Unicode character it is detecting but combination of Unicod

相关标签:
1条回答
  • 2021-01-14 20:10

    OK, following on from the comments above, I think it's highly likely that the input string is in UTF-8 (after all, in an HTML context, what else would it be?).

    On that basis, I humbly submit this:

    #include <string>
    #include <codecvt>
    #include <locale>
    
    std::string narrow (const std::wstring& ws)
    {
        std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
        return convert.to_bytes (ws);
    }
    
    std::wstring widen (const std::string& s)
    {
        std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
        return convert.from_bytes (s);
    }
    
    std::string detect_Unicode (const std::string& s)
    { 
        std::wstring ws = widen (s);
        if (ws.empty() || ws.find_first_not_of (L" \t\n\r\f\v\u00A0\u00C2\u00E2\u20AC\u2039") != std::wstring::npos)
            return " ";
        return s;
    }
    
    #include <iostream>
    
    int main ()
    {
        std::cout << narrow (L"\u00A0 \u00C2 \u00E2 \u20AC \u2039\n\n");
        std::cout << "0.\t\"" << detect_Unicode (u8"abcde") << "\"\n";
        std::cout << "1.\t\"" << detect_Unicode (u8" ​    ​ ") << "\"\n";
        std::cout << "2.\t\"" << detect_Unicode (u8"are   there is something    ​ combination    ​") << "\"\n";
        std::cout << "3.\t\"" << detect_Unicode (u8" Â Â ") << "\"\n";
        std::cout << "4.\t\"" << detect_Unicode (u8"​    ​") << "\"\n";
        std::cout << "5.\t\"" << detect_Unicode (u8"Â Â â â") << "\"\n";
    }
    

    Output:

      Â â € ‹
    
    0.  " "
    1.  " ​    ​ "
    2.  " "
    3.  " Â Â "
    4.  "​    ​"
    5.  "Â Â â â"
    

    Now this is not the output the OP expects, but I think that's simply because the logic (as opposed to the implementation) of detect_Unicode() looks flawed. The point here is that converting the input string to a wide string means that you can use standard basic_string operations on it reliably, because there are no multibyte issues now.

    An alternative, slightly radical, implementation of detect_Unicode() might be:

    for (auto wide_char : ws)
    {
        if (wide_char > 0xff)
            return " ";
    }
    return s;
    

    But really, now you have a wide string to hand in detect_Unicode, anything is possible, so go wild OP.

    Other notes:

    • std::codecvt is deprecated in C++17, but since there is no other obvious choice you might as well run with it. You can always change the implementations of narrow and widen if it comes to it.
    • Depending on platform, std::wstring might not be the best choice but it's probably fine. You could also look at std::u16string and std::u32string.

    Live demo.

    Inspiration taken from here.

    0 讨论(0)
提交回复
热议问题