Ignoring diacritic characters when comparing words with special characters (é, è, …)

前端 未结 8 1479
梦如初夏
梦如初夏 2021-02-05 18:08

I have a list with some Belgian cities with diacritic characters: (Liège, Quiévrain, Franière, etc.) and I would like to transform these special characters to compare with a lis

相关标签:
8条回答
  • 2021-02-05 18:29

    For those looking for a clean java solution, use apache commons:

    StringUtils.stripAccents("Liège").toUpperCase();
    

    this will return

    LIEGE
    
    0 讨论(0)
  • 2021-02-05 18:30

    The Collator class is a good way to do it (see corresponding javadoc). Here is a unit test that shows how to use it :

    import static org.junit.Assert.assertEquals;
    
    import java.text.Collator;
    import java.util.Locale;
    
    import org.junit.Test;
    
    public class CollatorTest {
        @Test public void liege() throws Exception {
            Collator compareOperator = Collator.getInstance(Locale.FRENCH);
            compareOperator.setStrength(Collator.PRIMARY);
    
            assertEquals(0, compareOperator.compare("Liege", "Liege")); // no accent
            assertEquals(0, compareOperator.compare("Liège", "Liege")); // with accent
            assertEquals(0, compareOperator.compare("LIEGE", "Liege")); // case insensitive
            assertEquals(0, compareOperator.compare("LIEGE", "Liège")); // case insensitive with accent
    
            assertEquals(1, compareOperator.compare("Liege", "Bruxelles"));
            assertEquals(-1, compareOperator.compare("Bruxelles", "Liege"));
        }
    }
    

    EDIT : sorry to see my answer did not meet your needs ; maybe it's beause I've presented it as unit test ? Is this ok for you ? I personnaly find it better because it's short and it uses the SDK (no need for String replacement)

    Collator compareOperator = Collator.getInstance(Locale.FRENCH);
    compareOperator.setStrength(Collator.PRIMARY);
    if (compareOperator.compare("Liège", "Liege") == 0) {
        // if we are here, then it's the "same" String
    }
    

    hope this helps

    0 讨论(0)
  • 2021-02-05 18:31

    If you still need that for Android API 8 or lower (Android 2.2, Java 1.5) where you don't have Normalizer class, here's my code, I think better to modify than Pentium10 answer:

    public class StringAccentRemover {
    
        @SuppressWarnings("serial")
        private static final HashMap<Character, Character> accents  = new HashMap<Character, Character>(){
            {
                put('Ą', 'A');
                put('Ę', 'E');
                put('Ć', 'C');
                put('Ł', 'L');
                put('Ń', 'N');
                put('Ó', 'O');
                put('Ś', 'S');
                put('Ż', 'Z');
                put('Ź', 'Z');
    
                put('ą', 'a');
                put('ę', 'e');
                put('ć', 'c');
                put('ł', 'l');
                put('ń', 'n');
                put('ó', 'o');
                put('ś', 's');
                put('ż', 'z');
                put('ź', 'z');
            }
        };
        /**
         * remove accented from a string and replace with ascii equivalent
         */
        public static String removeAccents(String s) {
            char[] result = s.toCharArray();
            for(int i=0; i<result.length; i++) {
                Character replacement = accents.get(result[i]);
                if (replacement!=null) result[i] = replacement;
            }
            return new String(result);
        }
    
    }
    
    0 讨论(0)
  • 2021-02-05 18:32

    Since class Normalizer is not supported in Froyo or previous Android versions, I have combined this and this (which I both voted up), and optimized it, obtaining a couple of helper methods. Method unaccentify simply converts diacritic chars to plain chars, while method slugify generates a slug for the input string. Hope it can be useful to someone. Here is the source code:

    import java.util.Arrays;
    import java.util.Locale;  
    import java.util.regex.Pattern;  
    
    public class SlugFroyo {
        private static final Pattern STRANGE = Pattern.compile("[^a-zA-Z0-9-]");
        private static final Pattern WHITESPACE = Pattern.compile("[\\s]");
    
        private static final String DIACRITIC_CHARS = "\u00C0\u00E0\u00C8\u00E8\u00CC\u00EC\u00D2\u00F2\u00D9\u00F9"
                + "\u00C1\u00E1\u00C9\u00E9\u00CD\u00ED\u00D3\u00F3\u00DA\u00FA\u00DD\u00FD"
                + "\u00C2\u00E2\u00CA\u00EA\u00CE\u00EE\u00D4\u00F4\u00DB\u00FB\u0176\u0177"
                + "\u00C3\u00E3\u00D5\u00F5\u00D1\u00F1"
                + "\u00C4\u00E4\u00CB\u00EB\u00CF\u00EF\u00D6\u00F6\u00DC\u00FC\u0178\u00FF"
                + "\u00C5\u00E5" + "\u00C7\u00E7" + "\u0150\u0151\u0170\u0171";
    
        private static final String PLAIN_CHARS = "AaEeIiOoUu" // grave
                + "AaEeIiOoUuYy" // acute
                + "AaEeIiOoUuYy" // circumflex
                + "AaOoNn" // tilde
                + "AaEeIiOoUuYy" // umlaut
                + "Aa" // ring
                + "Cc" // cedilla
                + "OoUu"; // double acute
    
        private static char[] lookup = new char[0x180];
    
        static {
            Arrays.fill(lookup, (char) 0);
            for (int i = 0; i < DIACRITIC_CHARS.length(); i++)
                lookup[DIACRITIC_CHARS.charAt(i)] = PLAIN_CHARS.charAt(i);
        }
    
        public static String slugify(String s) {
            String nowhitespace = WHITESPACE.matcher(s).replaceAll("-");
            String unaccented = unaccentify(nowhitespace);
            String slug = STRANGE.matcher(unaccented).replaceAll("");
            return slug.toLowerCase(Locale.ENGLISH);
        }
    
        public static String unaccentify(String s) {
            StringBuilder sb = new StringBuilder(s);
            for (int i = 0; i < sb.length(); i++) {
                char c = sb.charAt(i);
                if (c > 126 && c < lookup.length) {
                    char replacement = lookup[c];
                    if (replacement > 0)
                        sb.setCharAt(i, replacement);
                }
            }
            return sb.toString();
        }
    }
    
    0 讨论(0)
  • 2021-02-05 18:34

    I don't know if it is avaible on Android but on the JVM, you should not reimplement it in your project and reuse already existing code: just use org.apache.commons.lang3.StringUtils#stripAccents

    0 讨论(0)
  • 2021-02-05 18:38

    Check out this method in Java

    private static final String PLAIN_ASCII = "AaEeIiOoUu" // grave
                + "AaEeIiOoUuYy" // acute
                + "AaEeIiOoUuYy" // circumflex
                + "AaOoNn" // tilde
                + "AaEeIiOoUuYy" // umlaut
                + "Aa" // ring
                + "Cc" // cedilla
                + "OoUu" // double acute
        ;
    
        private static final String UNICODE = "\u00C0\u00E0\u00C8\u00E8\u00CC\u00EC\u00D2\u00F2\u00D9\u00F9"
                + "\u00C1\u00E1\u00C9\u00E9\u00CD\u00ED\u00D3\u00F3\u00DA\u00FA\u00DD\u00FD"
                + "\u00C2\u00E2\u00CA\u00EA\u00CE\u00EE\u00D4\u00F4\u00DB\u00FB\u0176\u0177"
                + "\u00C3\u00E3\u00D5\u00F5\u00D1\u00F1"
                + "\u00C4\u00E4\u00CB\u00EB\u00CF\u00EF\u00D6\u00F6\u00DC\u00FC\u0178\u00FF"
                + "\u00C5\u00E5" + "\u00C7\u00E7" + "\u0150\u0151\u0170\u0171";
    
        /**
         * remove accented from a string and replace with ascii equivalent
         */
        public static String removeAccents(String s) {
            if (s == null)
                return null;
            StringBuilder sb = new StringBuilder(s.length());
            int n = s.length();
            int pos = -1;
            char c;
            boolean found = false;
            for (int i = 0; i < n; i++) {
                pos = -1;
                c = s.charAt(i);
                pos = (c <= 126) ? -1 : UNICODE.indexOf(c);
                if (pos > -1) {
                    found = true;
                    sb.append(PLAIN_ASCII.charAt(pos));
                } else {
                    sb.append(c);
                }
            }
            if (!found) {
                return s;
            } else {
                return sb.toString();
            }
        }
    
    0 讨论(0)
提交回复
热议问题