ignoring hebrew vowels when comparing strings

Good evening, i hope you can help me with this problem, as I'm struggling to find solutions.

I have a provider of words, who give me vowelled Hebrew words , for example -

Vowelled - בַּיִת not vowelled - בית

Vowelled - הַבַּיְתָה not vowelled - הביתה

Unlike my provider, my user can't normally enter Hebrew vowels (nor should i want him to do that). The user story is the user seeking a word in the provided words. The problem is the comparison between the vowelled and the un-vowelled words. As each is represented by a different byte array in the memory, the equals method returns false.

I tried looking into how UTF-8 handles hebrew vowels and it seems like it's just normal characters.

I do want to present the vowels to the user, so i want to keep the string as-is in the memory, but when comparing i want to ignore them. Is there any simple way to solve this problem?

You can using a Collator. I can't tell you how exactly it's working as it's new to me, but this appears to do the trick:

public static void main( String[] args ) {
    String withVowels = "בַּיִת";
    String withoutVowels = "בית";

    String withVowelsTwo = "הַבַּיְתָה";
    String withoutVowelsTwo = "הביתה";

    System.out.println( "These two strings are " + (withVowels.equals( withoutVowels ) ? "" : "not ") + "equal" );
    System.out.println( "The second two strings are " + (withVowelsTwo.equals( withoutVowelsTwo ) ? "" : "not ") + "equal" );

    Collator collator = Collator.getInstance( new Locale( "he" ) );
    collator.setStrength( Collator.PRIMARY );

    System.out.println( collator.equals( withVowels, withoutVowels ) );
    System.out.println( collator.equals( withVowelsTwo, withoutVowelsTwo ) );
}

From that, I get the following output:

These two strings are not equal
The second two strings are not equal
true
true

AFAIK there isn't. Vowels are characters. Even some combinations of letters and dots are characters. See the wikipedia page.

http://en.wikipedia.org/wiki/Unicode_and_HTML_for_the_Hebrew_alphabet

You can store the search key for your words as characters only in the 05dx-05ex range. You can add another field for the word with the vowels.

Of course you should be expecting the following:

You should need to account for words that have different meaning according to nikkud.
You should take into account "mispellings" of י and ו, which are commonplace.

来源：https://stackoverflow.com/questions/12763476/ignoring-hebrew-vowels-when-comparing-strings

标签

java

encoding

hebrew