python function slowing down for no apparent reason

后端未结

关注

 6  1022

I have a python function defined as follows which i use to delete from list1 the items which are already in list2. I am using python 2.6.2 on windows XP

def comp


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  臣服心动        
                
              
                            
                2021-01-22 17:25
              
            
            
                                                                       
If we rule the data structure itself out, look at your memory usage next. If you end up asking the OS to swap in for you (i.e., the list takes up more memory than you have), Python's going to sit in iowait waiting on the OS to get the pages from disk, which makes sense given your description.

Is Python sitting in a jacuzzi of iowait when this slowdown happens? Anything else going on in the environment?

(If you're not sure, update with your platform and one of us will tell you how to tell.)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  不知归路        
                
              
                            
                2021-01-22 17:26
              
            
            
                                                                       
There are 2 issues that cause your algorithm to scale poorly:


x in list is an O(n) operation.
pop(n) where n is in the middle of the array is an O(n) operation.


Both situations cause it to scale poorly O(n^2) for large amounts of data.  gnud's implementation would probably be the best solution since it solves both problems without changing the order of elements or removing potential duplicates.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  温柔的废话        
                
              
                            
                2021-01-22 17:38
              
            
            
                                                                       
EDIT: I've updated my answer to account for lists being unhashable, as well as some other feedback.  This one is even tested.

It probably relates to the cost of poping an item out of a middle of a list.

Alternatively have you tried using sets to handle this?

def difference(list1, list2):
    return [x for x in list1 if tuple(x) in set(tuple(y) for y in list2)]


You can then set list one to the resulting list if that is your intention by doing

list1 = difference(list1, list2)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  没有蜡笔的小新        
                
              
                            
                2021-01-22 17:45
              
            
            
                                                                       
The only reason why the code can become slower is that you have big elements in both lists which share a lot of common elements (so the list1[curIndex] in list2 takes more time).

Here are a couple of ways to fix this:


If you don't care about the order, convert both lists into sets and use set1.difference(set2)
If the order in list1 is important, then at least convert list2 into a set because in is much faster with a set.
Lastly, try a filter: filter(list1, lambda x: x not in set2)


[EDIT] Since set() doesn't work on recursive lists (didn't expect that), try:

result = filter(list1, lambda x: x not in list2)


It should still be much faster than your version. If it isn't, then your last option is to make sure that there can't be duplicate elements in either list. That would allow you to remove items from both lists (and therefore making the compare ever cheaper as you find elements from list2).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  孤街浪徒        
                
              
                            
                2021-01-22 17:47
              
            
            
                                                                       
Try a more pythonic approach to the filtering, something like

[x for x in list1 if x not in set(list2)]


Converting both lists to sets is unnessescary, and will be very slow and memory hungry on large amounts of data.

Since your data is a list of lists, you need to do something in order to hash it.
Try out

list2_set = set([tuple(x) for x in list2])
diff = [x for x in list1 if tuple(x) not in list2_set]


I tested out your original function, and my approach, using the following test data:

list1 = [[x+1, x*2] for x in range(38000)]
list2 = [[x+1, x*2] for x in range(10000, 160000)]


Timings - not scientific, but still:

 #Original function
 real    2m16.780s
 user    2m16.744s
 sys     0m0.017s

 #My function
 real    0m0.433s
 user    0m0.423s
 sys     0m0.007s

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  执念已碎        
                
              
                            
                2021-01-22 17:50
              
            
            
                                                                       
The often suggested set wont work here, because the two lists contain lists, which are unhashable. You need to change your data structure first.

You can


convert the sublists into tuples or class instances to make them hashable, then use sets.
Keep both lists sorted, then you just have to compare the lists' heads.

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复