I\'m looking for suggestions regarding unicode aware std::string library replacements. I have a bunch of code that uses std::string, its iterators etc, and would like to now sup
I've written my own C++ UTF-8 library, which is a drop-in replacement of std::wstring
/string
. The data type that is showed to the user is char32_t
, but internally the wide characters are all packed into utf8 char
's.
The whole thing is quite fast and its performance is best with few unicode codepoints within many ascii codepoints. All operations that are known from std::string are available with this class (except for substring find
) and operate on codepoint indices, in contrast to byte indices.
As a bonus of defensive programming, the whole ANSI range of 0-255 can be used without multibytes :)
Hope this helps!
Asking for "a type like std::string, but for Unicode" is like asking for "a type like unsigned, but for primes." std::string is perfectly capable of storing Unicode, in many encodings - the most generally useful being UTF-8.
What you need to replace is your iterators, not your storage type. The iterators should iterate over the codepoints of the string rather than the bytes. That is, ++i
should advance one codepoint, and *i
should return a codepoint (via uint32_t) rather than a char
.
You don't need a complete rewrite if you make sure about what your std::string contains. For example, you could assume (and convert inputs to be sure) that your std::string contain UTF8 encoded strings (for those that need localization). Don't forget that std::string is only a container of raw data, it's not associated with an encoding (even in C++0x, it's only a possibility, not a requirement).
Then when you pass text to other libraries that require different encodings, you can use libraries like UTF8CPP to convert to the required encoding (but most of the time such libraries will do it themselves).
That way makes it simple. UTF8 with standard std::string in your code, enabling passing unicode string to everything else (with conversion if necessary).
There have been a lot of discussions about this in the boost community mailing list. Maybe reading it (if you have enough time...) can help you understand other possible solutions.
what unicode encoding do you need? If utf-8 is ok you can have a look at Glib::ustring
Glib::ustring has much the same interface as std::string, but contains Unicode characters encoded as UTF-8.
Depending on your needs, use std::wstring or the larger and more complex (but de facto standard) ICU: http://site.icu-project.org/