Unicode string normalization in C/C++

前端 未结 5 1699

Am wondering how to normalize strings (containing utf-8/utf-16) in C/C++. In .NET there is a function String.Normalize .

I used UTF8-CPP in the past but it does not prov

相关标签:
5条回答
  • 2021-02-13 03:51

    A good UTF-8 solution is glib's g_utf8_normalize() function. Would require to convert std::wstring to std::string (utf16 to utf8) if you need this for wstring too (which would make it quite an expensive solution, hence I'm looking myself for a better solution, if possible with pure C++(11) means).

    0 讨论(0)
  • 2021-02-13 03:59

    As I wrote in another question, utf8proc is a very nice, lightweight, library for basic Unicode functionality, including Unicode string normalization.

    0 讨论(0)
  • 2021-02-13 04:00

    You could build ICU with minimal (or possibly, no other data- I think all of the normalization data is now internal), and then statically link. I haven't tried this recently, but I believe the total size is pretty small in that case.

    0 讨论(0)
  • 2021-02-13 04:01

    "Lightweight" in your context means "with limited functionality". I would use ICU source as an example, and reference http://unicode.org/reports/tr15/ to implement this "lightweight" functionality.

    0 讨论(0)
  • 2021-02-13 04:07

    For Windows, there is the NormalizeString() function (unfortunately for Vista and later only - as far as I see on MSDN):

    http://msdn.microsoft.com/en-us/library/windows/desktop/dd319093%28v=vs.85%29.aspx

    It's the simplest way to go that I have found so far. I guess it's quite lightweight too.

    int NormalizeString(
        _In_      NORM_FORM NormForm,
        _In_      LPCWSTR   lpSrcString,
        _In_      int       cwSrcLength,
        _Out_opt_ LPWSTR    lpDstString,
        _In_      int       cwDstLength
    );
    
    0 讨论(0)
提交回复
热议问题