Remove duplicates in string

后端未结

关注

 3  834

I have the following data set

df <- data.frame(
    path = c(\"a,b,a\", 
        \"(direct) / (none),   (direct) / (none), google / cpc,    google / cpc\"


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  -上瘾入骨i        
                
              
                            
                2021-01-14 06:45
              
            
            
                                                                       
Basic logic behind below code : 

i)split each row on "," ,         (ii) remove whitespace  (iii) take unique values

(iv) collapse back on "," and paste   

t = apply(df, 1, function(x) paste0(unique(trimws(unlist(strsplit(x,",")))), collapse = ","))
df=data.frame(t)
# df
#                               t
#1                            a,b
#2 (direct) / (none),google / cpc
#3                            f,d
#4                            a,c

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  慢半拍i        
                
              
                            
                2021-01-14 06:51
              
            
            
                                                                       
You were almost there. The only thing is that you need to split with ",\\s*" instead of just ",". In the latter case, calling unique won't produce the wanted output, since some string may differ for the number of blank spaces. If you remove them when you split, you solve this issue.

On another note, since you used setDT(df), I guess you are using data.table. If so, you need to use proper data.table grammar to avoid copies:

df[,path:=sapply(
   strsplit(as.character(df$path ), split=",\\s*"), 
    function(x) {paste(unique(x), collapse = ', ')})]


will modify the path column by reference.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  我寻月下人不归        
                
              
                            
                2021-01-14 06:52
              
            
            
                                                                       
It looks like your problem is the initial white space in the second strings. Are you trying to preserve that, or are you willing to lose it? If you're willing to lose it, then

df$path <- sapply(strsplit(as.character(df$path), split=","), function(x) {
    paste(unique(trimws(x)), collapse = ', ') } )


is what you want:

> df$path <- sapply(strsplit(as.character(df$path), split=","), function(x) {
+     paste(unique(trimws(x)), collapse = ', ') } )
> df$path
[1] "a, b"                            "(direct) / (none), google / cpc"
[3] "f, d"                            "a, c"
>

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复