Group values by unique elements

前端未结

关注

 1  1158

I have a vector that looks like this:

a <- c(\"A110\",\"A110\",\"A110\",\"B220\",\"B220\",\"C330\",\"D440\",\"D440\",\"D440\",\"D440\",\"D440\",\"D440\",


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  情书的邮戳        
                
              
                            
                2020-12-12 04:00
              
            
            
                                                                       
First of all, (I assume) this is your vector

a <- c("A110","A110","A110","B220","B220","C330","D440","D440","D440","D440","D440","D440","E550")


As per possible solutions, here are few (can't find a good dupe right now)

as.integer(factor(a))
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5


Or

cumsum(!duplicated(a))
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5


Or

match(a, unique(a))
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5


Also rle will work the similarly in your specific scenario

with(rle(a), rep(seq_along(values), lengths))
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5


Or (which is practically the same)

data.table::rleid(a)
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5




Though be advised that all 4 solutions have their unique behavior in different scenarios, consider the following vector

a <- c("B110","B110","B110","A220","A220","C330","D440","D440","B110","B110","E550")


And the results of the 4 different solutions:

1.

as.integer(factor(a))
# [1] 2 2 2 1 1 3 4 4 2 2 5


The factor solution begins with 2 because a is unsorted and hence the first values are getting higher integer representation within the factor function. Hence, this solution is only valid if your vector is sorted, so don't use it other wise.

2.

cumsum(!duplicated(a))
# [1] 1 1 1 2 2 3 4 4 4 4 5


This cumsum/duplicated solution got confused because of "B110" already been present at the beginning and hence grouped "D440","D440","B110","B110" into the same group.

3.

match(a, unique(a))
# [1] 1 1 1 2 2 3 4 4 1 1 5


This match/unique solution added ones at the end, because it is sensitive to "B110" showing up in more than one sequences (because of unique) and hence grouping them all into same group regardless of where they appear

4.

with(rle(a), rep(seq_along(values), lengths))
# [1] 1 1 1 2 2 3 4 4 5 5 6


This solution only cares about sequences, hence different sequences of "B110" were grouped into different groups
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复