Code to strip diacritical marks using ICU

前端 未结 2 1063
醉话见心
醉话见心 2020-12-18 08:29

Can somebody please provide some sample code to strip diacritical marks (i.e., replace characters having accents, umlauts, etc., with their unaccented, unumlauted, etc., cha

相关标签:
2条回答
  • 2020-12-18 08:53

    After more searching elsewhere:

    UErrorCode status = U_ZERO_ERROR;
    UnicodeString result;
    
    // 's16' is the UTF-16 string to have diacritics removed
    Normalizer::normalize( s16, UNORM_NFKD, 0, result, status );
    if ( U_FAILURE( status ) )
      // complain
    
    // code to convert UTF-16 's16' to UTF-8 std::string 's8' elided
    
    string buf8;
    buf8.reserve( s8.length() );
    for ( string::const_iterator i = s8.begin(); i != s8.end(); ++i ) {
      char const c = *i;
      if ( isascii( c ) )
        buf8.push_back( c );
    }
    // result is in buf8
    

    which is O(n).

    0 讨论(0)
  • 2020-12-18 09:05

    ICU lets you transliterate a string using a specific rule. My rule is NFD; [:M:] Remove; NFC: decompose, remove diacritics, recompose. The following code takes an UTF-8 std::string as an input and returns another UTF-8 std::string:

    #include <unicode/utypes.h>
    #include <unicode/unistr.h>
    #include <unicode/translit.h>
    
    std::string desaxUTF8(const std::string& str) {
        // UTF-8 std::string -> UTF-16 UnicodeString
        UnicodeString source = UnicodeString::fromUTF8(StringPiece(str));
    
        // Transliterate UTF-16 UnicodeString
        UErrorCode status = U_ZERO_ERROR;
        Transliterator *accentsConverter = Transliterator::createInstance(
            "NFD; [:M:] Remove; NFC", UTRANS_FORWARD, status);
        accentsConverter->transliterate(source);
        // TODO: handle errors with status
    
        // UTF-16 UnicodeString -> UTF-8 std::string
        std::string result;
        source.toUTF8String(result);
    
        return result;
    }
    
    0 讨论(0)
提交回复
热议问题