BreakIterator in Android counts character wrongly

问题

I am using BreakIterator to count the number of visible character in a String. This works perfectly for English language. But in case of Hindi language it doesn't work as expected.

The below String has a length of 3, but is considered as single character visually.

ज्य

When I used BreakIterator, I expect it to consider it as a single unit, but it considers it as 2 units. The below is my code:

    final String text = "ज्य";
    final Locale locale = new Locale("hi","IN");
    final BreakIterator breaker = BreakIterator.getCharacterInstance(locale);
    breaker.setText(text);
    int start = breaker.first();
    for (int end = breaker.next();
         end != BreakIterator.DONE;
         start = end, end = breaker.next()) {

        final String substring = text.substring(start, end);
    }

Ideally, the for loop should be executed ONCE with start=0 and end=3; But for the String above it's executed twice (start=0, end=2 and start=2, end=3).

How can I get BreakIterator to work exactly?

UPDATE:

The above piece of code works perfectly when run as a JAVA program. It misbehaves only when used in ANDROID.

Since this happens only in Android, I have reported a bug in android: https://code.google.com/p/android/issues/detail?id=230832

回答1:

I think you need to play with unicode characters

Oracle Doc. for Character Boundaries

    final String text = "\u091C\u094D\u092F";
    final Locale locale = new Locale("hi","IN");
    final BreakIterator breaker = BreakIterator.getCharacterInstance(locale);
    breaker.setText(text);
    int start = breaker.first();
    for (int end = breaker.next();
         end != BreakIterator.DONE;
         start = end, end = breaker.next()) {

        final String substring = text.substring(start, end);
        System.out.println(substring);
    }

来源：https://stackoverflow.com/questions/41270091/breakiterator-in-android-counts-character-wrongly

标签

java

android

internationalization

hindi

icu4j