python function slowing down for no apparent reason

后端 未结 6 1022
小鲜肉
小鲜肉 2021-01-22 16:56

I have a python function defined as follows which i use to delete from list1 the items which are already in list2. I am using python 2.6.2 on windows XP

def comp         


        
相关标签:
6条回答
  • 2021-01-22 17:25

    If we rule the data structure itself out, look at your memory usage next. If you end up asking the OS to swap in for you (i.e., the list takes up more memory than you have), Python's going to sit in iowait waiting on the OS to get the pages from disk, which makes sense given your description.

    Is Python sitting in a jacuzzi of iowait when this slowdown happens? Anything else going on in the environment?

    (If you're not sure, update with your platform and one of us will tell you how to tell.)

    0 讨论(0)
  • 2021-01-22 17:26

    There are 2 issues that cause your algorithm to scale poorly:

    1. x in list is an O(n) operation.
    2. pop(n) where n is in the middle of the array is an O(n) operation.

    Both situations cause it to scale poorly O(n^2) for large amounts of data. gnud's implementation would probably be the best solution since it solves both problems without changing the order of elements or removing potential duplicates.

    0 讨论(0)
  • 2021-01-22 17:38

    EDIT: I've updated my answer to account for lists being unhashable, as well as some other feedback. This one is even tested.

    It probably relates to the cost of poping an item out of a middle of a list.

    Alternatively have you tried using sets to handle this?

    def difference(list1, list2):
        return [x for x in list1 if tuple(x) in set(tuple(y) for y in list2)]
    

    You can then set list one to the resulting list if that is your intention by doing

    list1 = difference(list1, list2)
    
    0 讨论(0)
  • 2021-01-22 17:45

    The only reason why the code can become slower is that you have big elements in both lists which share a lot of common elements (so the list1[curIndex] in list2 takes more time).

    Here are a couple of ways to fix this:

    • If you don't care about the order, convert both lists into sets and use set1.difference(set2)

    • If the order in list1 is important, then at least convert list2 into a set because in is much faster with a set.

    • Lastly, try a filter: filter(list1, lambda x: x not in set2)

    [EDIT] Since set() doesn't work on recursive lists (didn't expect that), try:

    result = filter(list1, lambda x: x not in list2)
    

    It should still be much faster than your version. If it isn't, then your last option is to make sure that there can't be duplicate elements in either list. That would allow you to remove items from both lists (and therefore making the compare ever cheaper as you find elements from list2).

    0 讨论(0)
  • 2021-01-22 17:47

    Try a more pythonic approach to the filtering, something like

    [x for x in list1 if x not in set(list2)]
    

    Converting both lists to sets is unnessescary, and will be very slow and memory hungry on large amounts of data.

    Since your data is a list of lists, you need to do something in order to hash it. Try out

    list2_set = set([tuple(x) for x in list2])
    diff = [x for x in list1 if tuple(x) not in list2_set]
    

    I tested out your original function, and my approach, using the following test data:

    list1 = [[x+1, x*2] for x in range(38000)]
    list2 = [[x+1, x*2] for x in range(10000, 160000)]
    

    Timings - not scientific, but still:

     #Original function
     real    2m16.780s
     user    2m16.744s
     sys     0m0.017s
    
     #My function
     real    0m0.433s
     user    0m0.423s
     sys     0m0.007s
    
    0 讨论(0)
  • 2021-01-22 17:50

    The often suggested set wont work here, because the two lists contain lists, which are unhashable. You need to change your data structure first.

    You can

    • convert the sublists into tuples or class instances to make them hashable, then use sets.
    • Keep both lists sorted, then you just have to compare the lists' heads.
    0 讨论(0)
提交回复
热议问题