Remove all non-“word characters” from a String in Java, leaving accented characters?

后端 未结 5 1229
被撕碎了的回忆
被撕碎了的回忆 2020-11-28 02:18

Apparently Java\'s Regex flavor counts Umlauts and other special characters as non-\"word characters\" when I use Regex.

        \"TESTÜTEST\".replaceAll( \"         


        
相关标签:
5条回答
  • 2020-11-28 02:48

    I was trying to achieve the exact opposite when I bumped on this thread. I know it's quite old, but here's my solution nonetheless. You can use blocks, see here. In this case, compile the following code (with the right imports):

    > String s = "äêìóblah"; 
    > Pattern p = Pattern.compile("[\\p{InLatin-1Supplement}]+"); // this regex uses a block
    > Matcher m = p.matcher(s);
    > System.out.println(m.find());
    > System.out.println(s.replaceAll(p.pattern(), "#"));
    

    You should see the following output:

    true

    #blah

    Best,

    0 讨论(0)
  • 2020-11-28 02:55

    Well, here is one solution I ended up with, but I hope there's a more elegant one...

    StringBuilder result = new StringBuilder();
    for(int i=0; i<name.length(); i++) {
        char tmpChar = name.charAt( i );
        if (Character.isLetterOrDigit( tmpChar) || tmpChar == '_' ) {
            result.append( tmpChar );
        }
    }
    

    result ends up with the desired result...

    0 讨论(0)
  • 2020-11-28 02:55

    You might want to remove the accents and diacritic signs first, then on each character position check if the "simplified" string is an ascii letter - if it is, the original position shall contain word characters, if not, it can be removed.

    0 讨论(0)
  • 2020-11-28 02:56

    Use [^\p{L}\p{Nd}]+ - this matches all (Unicode) characters that are neither letters nor (decimal) digits.

    In Java:

    String resultString = subjectString.replaceAll("[^\\p{L}\\p{Nd}]+", "");
    

    Edit:

    I changed \p{N} to \p{Nd} because the former also matches some number symbols like ¼; the latter doesn't. See it on regex101.com.

    0 讨论(0)
  • 2020-11-28 03:01

    At times you do not want to simply remove the characters, but just remove the accents. I came up with the following utility class which I use in my Java REST web projects whenever I need to include a String in an URL:

    import java.text.Normalizer;
    import java.text.Normalizer.Form;
    
    import org.apache.commons.lang.StringUtils;
    
    /**
     * Utility class for String manipulation.
     * 
     * @author Stefan Haberl
     */
    public abstract class TextUtils {
        private static String[] searchList = { "Ä", "ä", "Ö", "ö", "Ü", "ü", "ß" };
        private static String[] replaceList = { "Ae", "ae", "Oe", "oe", "Ue", "ue",
                "sz" };
    
        /**
         * Normalizes a String by removing all accents to original 127 US-ASCII
         * characters. This method handles German umlauts and "sharp-s" correctly
         * 
         * @param s
         *            The String to normalize
         * @return The normalized String
         */
        public static String normalize(String s) {
            if (s == null)
                return null;
    
            String n = null;
    
            n = StringUtils.replaceEachRepeatedly(s, searchList, replaceList);
            n = Normalizer.normalize(n, Form.NFD).replaceAll("[^\\p{ASCII}]", "");
    
            return n;
        }
    
        /**
         * Returns a clean representation of a String which might be used safely
         * within an URL. Slugs are a more human friendly form of URL encoding a
         * String.
         * <p>
         * The method first normalizes a String, then converts it to lowercase and
         * removes ASCII characters, which might be problematic in URLs:
         * <ul>
         * <li>all whitespaces
         * <li>dots ('.')
         * <li>(semi-)colons (';' and ':')
         * <li>equals ('=')
         * <li>ampersands ('&')
         * <li>slashes ('/')
         * <li>angle brackets ('<' and '>')
         * </ul>
         * 
         * @param s
         *            The String to slugify
         * @return The slugified String
         * @see #normalize(String)
         */
        public static String slugify(String s) {
    
            if (s == null)
                return null;
    
            String n = normalize(s);
            n = StringUtils.lowerCase(n);
            n = n.replaceAll("[\\s.:;&=<>/]", "");
    
            return n;
        }
    }
    

    Being a German speaker I've included proper handling of German umlauts as well - the list should be easy to extend for other languages.

    HTH

    EDIT: Note that it may be unsafe to include the returned String in an URL. You should at least HTML encode it to prevent XSS attacks.

    0 讨论(0)
提交回复
热议问题