Replace multiple consecutive occurrences of a character with a single occurrence

后端 未结 5 543
被撕碎了的回忆
被撕碎了的回忆 2021-01-07 14:47

I am making a natural language language processing application in Java, I am using data from IMDB and Amazon.

I came across a certain dataset which has words like

相关标签:
5条回答
  • 2021-01-07 15:18

    There are no English words that I know of that have more than two consecutive identical letters.

    1. Iterate over all words
    2. If the word has more than two consecutive identical letters, then:
      • Remove all but two of the duplicate letters, and see if a valid word is formed.
      • Otherwise, remove all but one duplicate letter, and see if a valid word is formed.
      • Otherwise, fail.

    This approach would not catch:

    • partyy

    • "stoop" (plus that's ambiguous! Is that "stop" with an extra "o" or simply "stoop")

    0 讨论(0)
  • 2021-01-07 15:22

    You can use regex to find letters that have same letter after it at least two times (since we don't want to remove correct letters like m in comma)

    String data="stoooooop partyyyyyy";
    System.out.println(data.replaceAll("([a-zA-Z])\\1{2,}", "$1"));
    //                                       |      |         |
    //                                   group 1   match    replace with 
    //                                             from     match from group 1
    //                                             group 1
    //                                             repeated 
    //                                           twice or more
    

    Output:

    stop party
    
    0 讨论(0)
  • 2021-01-07 15:28

    You can use this snippet its quite fast implementation.

    public static String removeConsecutiveChars(String str) {
    
            if (str == null) {
                return null;
            }
    
            int strLen = str.length();
            if (strLen <= 1) {
                return str;
            }
    
            char[] strChar = str.toCharArray();
            char temp = strChar[0];
    
            StringBuilder stringBuilder = new StringBuilder(strLen);
            for (int i = 1; i < strLen; i++) {
    
                char val = strChar[i];
                if (val != temp) {
                    stringBuilder.append(temp);
                    temp = val;
                }
            }
            stringBuilder.append(temp);
    
            return stringBuilder.toString();
        }
    
    0 讨论(0)
  • 2021-01-07 15:28

    Try using loop,

     String word="Stoooppppd";
        StringBuilder res=new StringBuilder();
        char first=word.charAt(0);
        res.append(first);
        for (int i = 1; i < word.length(); i++) {
            char ch=word.charAt(i);
            if(ch!=first){
               res.append(ch);
            }
           first=ch;
        }
        System.out.println(res);
    
    0 讨论(0)
  • 2021-01-07 15:30

    You may wish to use \p{L}\p{M}* instead of [a-zA-Z] to include non-English unicode letters as well. So it will be like this: replaceAll("(\\p{L}\\p{M}*)(\\1{" + maxAllowedRepetition + ",})", "$1"); or this: replaceAll("(\\p{L}\\p{M}*)\\1{" + maxAllowedRepetition + ",}", "$1");

    0 讨论(0)
提交回复
热议问题