I am trying to detect some of the combination of Unicode character (like ​) to cleanup the string, For a single Unicode character it is detecting but combination of Unicod
OK, following on from the comments above, I think it's highly likely that the input string is in UTF-8 (after all, in an HTML context, what else would it be?).
On that basis, I humbly submit this:
#include <string>
#include <codecvt>
#include <locale>
std::string narrow (const std::wstring& ws)
{
std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
return convert.to_bytes (ws);
}
std::wstring widen (const std::string& s)
{
std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
return convert.from_bytes (s);
}
std::string detect_Unicode (const std::string& s)
{
std::wstring ws = widen (s);
if (ws.empty() || ws.find_first_not_of (L" \t\n\r\f\v\u00A0\u00C2\u00E2\u20AC\u2039") != std::wstring::npos)
return " ";
return s;
}
#include <iostream>
int main ()
{
std::cout << narrow (L"\u00A0 \u00C2 \u00E2 \u20AC \u2039\n\n");
std::cout << "0.\t\"" << detect_Unicode (u8"abcde") << "\"\n";
std::cout << "1.\t\"" << detect_Unicode (u8" ​ ​ ") << "\"\n";
std::cout << "2.\t\"" << detect_Unicode (u8"are   there is something    ​ combination ​") << "\"\n";
std::cout << "3.\t\"" << detect_Unicode (u8" Â Â ") << "\"\n";
std::cout << "4.\t\"" << detect_Unicode (u8"​   ​") << "\"\n";
std::cout << "5.\t\"" << detect_Unicode (u8"Â Â â â") << "\"\n";
}
Output:
 ⠀ ‹
0. " "
1. " ​ ​ "
2. " "
3. " Â Â "
4. "​   ​"
5. "Â Â â â"
Now this is not the output the OP expects, but I think that's simply because the logic (as opposed to the implementation) of detect_Unicode()
looks flawed. The point here is that converting the input string to a wide string means that you can use standard basic_string
operations on it reliably, because there are no multibyte issues now.
An alternative, slightly radical, implementation of detect_Unicode()
might be:
for (auto wide_char : ws)
{
if (wide_char > 0xff)
return " ";
}
return s;
But really, now you have a wide string to hand in detect_Unicode
, anything is possible, so go wild OP.
Other notes:
std::codecvt
is deprecated in C++17, but since there is no other obvious choice you might as well run with it. You can always change the implementations of narrow
and widen
if it comes to it.std::wstring
might not be the best choice but it's probably fine. You could also look at std::u16string
and std::u32string
.Live demo.
Inspiration taken from here.