Remove consecutive duplicate words from a file using awk or sed

前端未结

关注

 6  1371

不思量自难忘° 2021-01-16 16:53

My input file looks like below:

“true true, rohith Rohith;
cold burn, and fact and fact good good?”

Output shoud look like:


      
      
        
          6条回答        

        
                    
            
            
                         
                
              
              
                
                   借酒劲吻你
                                             
                
                
                (楼主)
            
              
              
                2021-01-16 17:51
              

            
            
                        
Just match the same backreference in sed:

sed ':l; s/\(^\|[^[:alpha:]]\)\([[:alpha:]]\{1,\}\)[^[:alpha:]]\{1,\}\2\($\|[^[:alpha:]]\)/\1\2\3/g; tl'


How it works:


:l - create a label l to jump to. See tl below.
s - substitute


/
\(^\|[^[:alpha:]]\) - match beginning of the line or non-alphabetic character. This is so that the next part matches the whole word, not only suffix.
\([[:alpha:]]\{1,\}\) - match a word - one or more alphabetic characters.
[^[:alpha:]]\{1,\} - match a non-word - one or more non-alphabetic characters.
\2 - match the same thing as in the second \(...\) - ie. match the word.
\($\|[^[:alpha:]]\) - match the end of the line or match a non-alphabetic character. That is so we match the whole second word, not only it's prefix.
/
\1\2\3 - substitute it for 
/
g - substitute globally. But, because regex is never going back, it will substitute 2 words at a time.

tl - Jump to label l if last s command was successfull. This is here, so that when there are 3 words the same, like true true true, they are properly replaced by a single true.


Without the \(^\|[^[:alpha:]]\) and \($\|[^[:alpha:]]\), without them for example true rue would be substituted by true, because the suffix rue rue would match.

Below are my other solution, which also remove repeated words across lines.

My first solution was with uniq. So first I will transform the input into pairs with the format  . Then run it via uniq -f1 with ignoring first field and then convert back. This will be very slow:


# recreate input
cat <


But then I noticed that sed is doing a good job at tokenizing the input - it places zero bytes between each word and non-word tokens. So I could easily read the stream. I can ignore repeated words in awk by reading zero separated stream in GNU awk and comparing the last readed word:

cat <


In place of zero byte something unique could be used as record separator, for example ^ character, that way it could be used with non-GNU awk version, tested with mawk available on repl. Shortened the script by using shorter variable names here:

cat <


Tested on repl. The snippets output:

true, rohith Rohith;
cold burn, and fact and fact good?

    
             
                                                        
            

            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它6个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          

                              			
        

        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复