Remove consecutive duplicate words from a file using awk or sed

前端未结

关注

 6  1368

My input file looks like below:

“true true, rohith Rohith;
cold burn, and fact and fact good good?”

Output shoud look like:


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  余生分开走        
                
              
                            
                2021-01-16 17:32
              
            
            
                                                                       
This is not exactly what you have shown in output but is close using gnu-awk:

awk -v RS='[^-_[:alnum:]]+' '$1 == p{printf "%s", RT; next} {p=$1; ORS=RT} 1' file




“true , rohith Rohith;
cold burn, and fact and fact good ?”

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  耶瑟儿～        
                
              
                            
                2021-01-16 17:34
              
            
            
                                                                       
With GNU awk for the 4th arg to split():

$ cat tst.awk
{
    n = split($0,words,/[^[:alpha:]]+/,seps)
    prev = ""
    for (i=1; i<=n; i++) {
        word = words[i]
        if (word != prev) {
            printf "%s%s", seps[i-1], word
        }
        prev = word
    }
    print ""
}

$ awk -f tst.awk file
“true, rohith Rohith;
cold burn, and fact and fact good?”

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  长情又很酷        
                
              
                            
                2021-01-16 17:34
              
            
            
                                                                       
sed -E 's/(\w+) *\1/\1/g' sample.txt


sample.txt

“true true, rohith Rohith;
cold burn, and fact and fact good good?”


output:

:~$ sed -E 's/(\w+) *\1/\1/g' sample.txt
“true, rohith Rohith;
cold burn, and fact and fact good?”


Explanation

(\w) *\1 - matches a word separated by a space of the same word and saves it
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  鱼传尺愫        
                
              
                            
                2021-01-16 17:46
              
            
            
                                                                       
Depending on your expected input, this might work:

sed -r 's/([a-zA-Z0-9_-]+)( *)\1/\1\2/g ; s/ ([.,;:])/\1/g ; s/  / /g' myfile


([a-zA-Z0-9_-]+) = words that might be repeated.

( *)\1 = check if the previous word is repeated after a space.

s/ ([.,;:])/\1/g = removes extra spaces before punctuation (you might want to add characters to this group).

s/  / /g = removes double spaces.

This works with GNU sed.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情话喂你        
                
              
                            
                2021-01-16 17:48
              
            
            
                                                                       
Simple sed:

echo "true true, rohith Rohith;
cold burn, and fact and fact good good?" | sed -r 's/(\w+) (\1)/\1/g'

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  借酒劲吻你        
                
              
                            
                2021-01-16 17:51
              
            
            
                                                                       
Just match the same backreference in sed:

sed ':l; s/\(^\|[^[:alpha:]]\)\([[:alpha:]]\{1,\}\)[^[:alpha:]]\{1,\}\2\($\|[^[:alpha:]]\)/\1\2\3/g; tl'


How it works:


:l - create a label l to jump to. See tl below.
s - substitute


/
\(^\|[^[:alpha:]]\) - match beginning of the line or non-alphabetic character. This is so that the next part matches the whole word, not only suffix.
\([[:alpha:]]\{1,\}\) - match a word - one or more alphabetic characters.
[^[:alpha:]]\{1,\} - match a non-word - one or more non-alphabetic characters.
\2 - match the same thing as in the second \(...\) - ie. match the word.
\($\|[^[:alpha:]]\) - match the end of the line or match a non-alphabetic character. That is so we match the whole second word, not only it's prefix.
/
\1\2\3 - substitute it for <beginning of the line or non-alphabetic prefix character><the word><end of the line or non-alphabetic suffix character found>
/
g - substitute globally. But, because regex is never going back, it will substitute 2 words at a time.

tl - Jump to label l if last s command was successfull. This is here, so that when there are 3 words the same, like true true true, they are properly replaced by a single true.


Without the \(^\|[^[:alpha:]]\) and \($\|[^[:alpha:]]\), without them for example true rue would be substituted by true, because the suffix rue rue would match.

Below are my other solution, which also remove repeated words across lines.

My first solution was with uniq. So first I will transform the input into pairs with the format <non-alphabetical sequence separating words encoded in hex> <a word>. Then run it via uniq -f1 with ignoring first field and then convert back. This will be very slow:

# recreate input
cat <<EOF |
true true, rohith Rohith;
cold burn, and fact and fact good good?
EOF
# insert zero byte after each word and non-word
# the -z option is from GNU sed
sed -r -z 's/[[:alpha:]]+/\x00&\x00/g' |
# for each pair (non-word, word)
xargs -0 -n2 sh -c '
    # ouptut hexadecimal representation of non-word
    printf "%s" "$1" | xxd -p | tr -d "\n"
    # and output space with the word
    printf " %s\n" "$2"
' -- |
# uniq ignores empty fields - so make sure field1 always has something
sed 's/^/-/' |
# uniq while ignoring first field
uniq -f1 |
# for each pair (non-word in hex, word)
xargs -n2 bash -c '
    # just `printf "%s" "$1" | sed 's/^-//' | xxd -r -p` for posix shell
    # change non-word from hex to characters
    printf "%s" "${1:1}" | xxd -r -p
    # output word
    printf "%s" "$2"
' --


But then I noticed that sed is doing a good job at tokenizing the input - it places zero bytes between each word and non-word tokens. So I could easily read the stream. I can ignore repeated words in awk by reading zero separated stream in GNU awk and comparing the last readed word:

cat <<EOF |
true true, rohith Rohith;
cold burn, and fact and fact good good?
EOF
sed -r -z 's/[[:alpha:]]+/\x00&\x00/g' |
gawk -vRS='\0' '
NR%2==1{
    nonword=$0
}
NR%2==0{
    if (length(lastword) && lastword != $0) {
        printf "%s%s", lastword, nonword
    }
    lastword=$0
}
END{
    printf "%s%s", lastword, nonword
}'


In place of zero byte something unique could be used as record separator, for example ^ character, that way it could be used with non-GNU awk version, tested with mawk available on repl. Shortened the script by using shorter variable names here:

cat <<EOF |
true true, rohith Rohith;
cold burn, and fact and fact good good?
EOF
sed -r 's/[[:alpha:]]+/^&^/g' |
awk -vRS='^' '
    NR%2{ n=$0 }
    NR%2-1 && length(l) && l != $0 { printf "%s%s", l, n }
    NR%2-1 { l=$0 }
    END { printf "%s%s", l, n }
'


Tested on repl. The snippets output:

true, rohith Rohith;
cold burn, and fact and fact good?

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复