I have a python function defined as follows which i use to delete from list1 the items which are already in list2. I am using python 2.6.2 on windows XP
def comp
If we rule the data structure itself out, look at your memory usage next. If you end up asking the OS to swap in for you (i.e., the list takes up more memory than you have), Python's going to sit in iowait waiting on the OS to get the pages from disk, which makes sense given your description.
Is Python sitting in a jacuzzi of iowait when this slowdown happens? Anything else going on in the environment?
(If you're not sure, update with your platform and one of us will tell you how to tell.)
There are 2 issues that cause your algorithm to scale poorly:
x in list
is an O(n) operation.pop(n)
where n is in the middle of the array is an O(n) operation.Both situations cause it to scale poorly O(n^2) for large amounts of data. gnud's implementation would probably be the best solution since it solves both problems without changing the order of elements or removing potential duplicates.
EDIT: I've updated my answer to account for lists being unhashable, as well as some other feedback. This one is even tested.
It probably relates to the cost of poping an item out of a middle of a list.
Alternatively have you tried using sets to handle this?
def difference(list1, list2):
return [x for x in list1 if tuple(x) in set(tuple(y) for y in list2)]
You can then set list one to the resulting list if that is your intention by doing
list1 = difference(list1, list2)
The only reason why the code can become slower is that you have big elements in both lists which share a lot of common elements (so the list1[curIndex] in list2
takes more time).
Here are a couple of ways to fix this:
If you don't care about the order, convert both lists into set
s and use set1.difference(set2)
If the order in list1 is important, then at least convert list2
into a set because in
is much faster with a set
.
Lastly, try a filter: filter(list1, lambda x: x not in set2)
[EDIT] Since set()
doesn't work on recursive lists (didn't expect that), try:
result = filter(list1, lambda x: x not in list2)
It should still be much faster than your version. If it isn't, then your last option is to make sure that there can't be duplicate elements in either list. That would allow you to remove items from both lists (and therefore making the compare ever cheaper as you find elements from list2
).
Try a more pythonic approach to the filtering, something like
[x for x in list1 if x not in set(list2)]
Converting both lists to sets is unnessescary, and will be very slow and memory hungry on large amounts of data.
Since your data is a list of lists, you need to do something in order to hash it. Try out
list2_set = set([tuple(x) for x in list2])
diff = [x for x in list1 if tuple(x) not in list2_set]
I tested out your original function, and my approach, using the following test data:
list1 = [[x+1, x*2] for x in range(38000)]
list2 = [[x+1, x*2] for x in range(10000, 160000)]
Timings - not scientific, but still:
#Original function
real 2m16.780s
user 2m16.744s
sys 0m0.017s
#My function
real 0m0.433s
user 0m0.423s
sys 0m0.007s
The often suggested set
wont work here, because the two lists contain lists, which are unhashable. You need to change your data structure first.
You can