how to detect “â€‹” (combination of unicode) in c++ string

后端未结

关注

 1  703

I am trying to detect some of the combination of Unicode character (like â€‹) to cleanup the string, For a single Unicode character it is detecting but combination of Unicod

相关标签:

1条回答

没有蜡笔的小新

2021-01-14 20:10

OK, following on from the comments above, I think it's highly likely that the input string is in UTF-8 (after all, in an HTML context, what else would it be?).

On that basis, I humbly submit this:

#include <string>
#include <codecvt>
#include <locale>

std::string narrow (const std::wstring& ws)
{
    std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
    return convert.to_bytes (ws);
}

std::wstring widen (const std::string& s)
{
    std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
    return convert.from_bytes (s);
}

std::string detect_Unicode (const std::string& s)
{ 
    std::wstring ws = widen (s);
    if (ws.empty() || ws.find_first_not_of (L" \t\n\r\f\v\u00A0\u00C2\u00E2\u20AC\u2039") != std::wstring::npos)
        return " ";
    return s;
}

#include <iostream>

int main ()
{
    std::cout << narrow (L"\u00A0 \u00C2 \u00E2 \u20AC \u2039\n\n");
    std::cout << "0.\t\"" << detect_Unicode (u8"abcde") << "\"\n";
    std::cout << "1.\t\"" << detect_Unicode (u8" â€‹    â€‹ ") << "\"\n";
    std::cout << "2.\t\"" << detect_Unicode (u8"are Â Â there is something Â Â Â â€‹ combination    â€‹") << "\"\n";
    std::cout << "3.\t\"" << detect_Unicode (u8" Â Â ") << "\"\n";
    std::cout << "4.\t\"" << detect_Unicode (u8"â€‹  Â Â â€‹") << "\"\n";
    std::cout << "5.\t\"" << detect_Unicode (u8"Â Â â â") << "\"\n";
}

Output:

  Â â € ‹

0.  " "
1.  " â€‹    â€‹ "
2.  " "
3.  " Â Â "
4.  "â€‹  Â Â â€‹"
5.  "Â Â â â"

Now this is not the output the OP expects, but I think that's simply because the logic (as opposed to the implementation) of detect_Unicode() looks flawed. The point here is that converting the input string to a wide string means that you can use standard basic_string operations on it reliably, because there are no multibyte issues now.

An alternative, slightly radical, implementation of detect_Unicode() might be:

for (auto wide_char : ws)
{
    if (wide_char > 0xff)
        return " ";
}
return s;

But really, now you have a wide string to hand in detect_Unicode, anything is possible, so go wild OP.

Other notes:

std::codecvt is deprecated in C++17, but since there is no other obvious choice you might as well run with it. You can always change the implementations of narrow and widen if it comes to it.
Depending on platform, std::wstring might not be the best choice but it's probably fine. You could also look at std::u16string and std::u32string.

Live demo.

Inspiration taken from here.

0 讨论(0)