ignoring hebrew vowels when comparing strings

问题

Good evening, i hope you can help me with this problem, as I'm struggling to find solutions.

I have a provider of words, who give me vowelled Hebrew words , for example -

Vowelled - בַּיִת not vowelled - בית

Vowelled - הַבַּיְתָה not vowelled - הביתה

Unlike my provider, my user can't normally enter Hebrew vowels (nor should i want him to do that). The user story is the user seeking a word in the provided words. The problem is the comparison between the vowelled and the un-vowelled words. As each is represented by a different byte array in the memory, the equals method returns false.

I tried looking into how UTF-8 handles hebrew vowels and it seems like it's just normal characters.

I do want to present the vowels to the user, so i want to keep the string as-is in the memory, but when comparing i want to ignore them. Is there any simple way to solve this problem?

回答1:

You can using a Collator. I can't tell you how exactly it's working as it's new to me, but this appears to do the trick:

public static void main( String[] args ) {
    String withVowels = "בַּיִת";
    String withoutVowels = "בית";

    String withVowelsTwo = "הַבַּיְתָה";
    String withoutVowelsTwo = "הביתה";

    System.out.println( "These two strings are " + (withVowels.equals( withoutVowels ) ? "" : "not ") + "equal" );
    System.out.println( "The second two strings are " + (withVowelsTwo.equals( withoutVowelsTwo ) ? "" : "not ") + "equal" );

    Collator collator = Collator.getInstance( new Locale( "he" ) );
    collator.setStrength( Collator.PRIMARY );

    System.out.println( collator.equals( withVowels, withoutVowels ) );
    System.out.println( collator.equals( withVowelsTwo, withoutVowelsTwo ) );
}

From that, I get the following output:

These two strings are not equal
The second two strings are not equal
true
true

回答2:

AFAIK there isn't. Vowels are characters. Even some combinations of letters and dots are characters. See the wikipedia page.

http://en.wikipedia.org/wiki/Unicode_and_HTML_for_the_Hebrew_alphabet

You can store the search key for your words as characters only in the 05dx-05ex range. You can add another field for the word with the vowels.

Of course you should be expecting the following:

You should need to account for words that have different meaning according to nikkud.
You should take into account "mispellings" of י and ו, which are commonplace.

来源：https://stackoverflow.com/questions/12763476/ignoring-hebrew-vowels-when-comparing-strings

标签

java

encoding

hebrew