Python delete from list while iterating

前端未结

关注

 3  869

I have a list of strings and I want to keep only the most unique strings. Here is how I have implemented this (maybe there\'s an issue with the loop),

def filter


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  北荒        
                
              
                            
                2021-01-21 16:12
              
            
            
                                                                       
The Problem with your logic is that each time when you delete an item from the array, the index gets re-arranged and skips a string in between. Eg:

Assume that this is the array:
Description : ["A","A","A","B","C"]

iterartion 1: 

i=0                      -------------0
description[i]="A"
j=i+1                    -------------1
description[j]="A"
similarity_ratio>0.6
del description[j]


Now the array is re-indexed like:
Description:["A","A","B","C"]. The next step is:

 j=j+1                   ------------1+1= 2


Description[2]="B"

You have skipped Description[1]="A"



To fix this :
Replace 

j+=1 


With

j=i+1


if deleted. Else do the normal j=j+1 iteration
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  滥情空心        
                
              
                            
                2021-01-21 16:18
              
            
            
                                                                       
If you invert your logic, you can escape having to modify the list in place and still reduce the number of comparisons needed. That is, start with an empty output/unique list and iterate over your descriptions seeing if you can add each one. So for the first description you can add it immediately as it cannot be similar to anything in an empty list. The second description only needs to be compared to the first as opposed to all other descriptions. Later iterations can short circuit as soon as they find a previous description with which they are similar to (and have the candidate description be discarded). ie.

import operator

def unique(items, compare=operator.eq):
    # compare is a function that returns True if its two arguments are deemed similar to 
    # each other and False otherwise.

    unique_items = []
    for item in items:
        if not any(compare(item, uniq) for uniq in unique_items):
            # any will stop as soon as compare(item, uniq) returns True
            # you could also use `if all(not compare(item, uniq) ...` if you prefer
            unique_items.append(item)

    return unique_items


Examples:

assert unique([2,3,4,5,1,2,3,3,2,1]) == [2, 3, 4, 5, 1]
# note that order is preserved

assert unique([1, 2, 0, 3, 4, 5], compare=(lambda x, y: abs(x - y) <= 1))) == [1, 3, 5]
# using a custom comparison function we can exclude items that are too similar to previous
# items. Here 2 and 0 are excluded because they are too close to 1 which was accepted
# as unique first. Change the order of 3 and 4, and then 5 would also be excluded.


With your code your comparison function would look like:

MAX_SIMILAR_ALLOWED = 0.6  #40% unique and 60% similar

def description_cmp(candidate_desc, unique_desc):
    # use unique_desc as first arg as this keeps the argument order the same as with your filter 
    # function where the first description is the one that is retained if the two descriptions 
    # are deemed to be too similar
    similarity_ratio = SequenceMatcher(None, unique_desc, candidate_desc).ratio()
    return similarity_ratio > MAX_SIMILAR_ALLOWED

def filter_descriptions(descriptions):
    # This would be the new definition of your filter_descriptions function
    return unique(descriptions, compare=descriptions_cmp)


The number of comparisons should be exactly the same. That is, in your implementation the first element is compared to all the others, and the second element is only compared to elements that were deemed not similar to the first element and so on. In this implementation the first item is not compared to anything initially, but all other items must be compared to it to be allowed to be added to the unique list. Only items deemed not similar to the first item will be compared to the second unique item, and so on.

The unique implementation will do less copying as it only has to copy the unique list when the backing array runs out of space. Whereas, with the del statement parts of the list must be copied each time it is used (to move all subsequent items into their new correct position). This will likely have a negligible impact on performance though, as the bottleneck is probably the ratio calculation in the sequence matcher.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  别跟我提以往        
                
              
                            
                2021-01-21 16:25
              
            
            
                                                                       
The value of j should not change when an item from the list is deleted (since a different list item will be present on that spot in the next iteration). Doing j=i+1 restarts the iteration every time an item is deleted (which is not what is desired). The updated code now only increments j in the else condition. 

def filter_descriptions(descriptions):
    MAX_SIMILAR_ALLOWED = 0.6  #40% unique and 60% similar
    i = 0
    while i < len(descriptions):
        print("Processing {}/{}...".format(i + 1, len(descriptions)))
        desc_to_evaluate = descriptions[i]
        j = i + 1
        while j < len(descriptions):
            similarity_ratio = SequenceMatcher(None, desc_to_evaluate, descriptions[j]).ratio()
            if similarity_ratio > MAX_SIMILAR_ALLOWED:
                del descriptions[j]
            else:
                j += 1
        i += 1
    return descriptions

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复