Question:
I\'ve profiled my Python program to death, and there is one function that is slowing everything down. It uses Python dictionaries heavily, so
This would require a fair amount of work, but...you might consider using Floyd-Warshall running on a GPU. There has been a lot of work done on making Floyd-Warshall run very efficiently on a GPU. A quick google search yields:
http://cvit.iiit.ac.in/papers/Pawan07accelerating.pdf
http://my.safaribooksonline.com/book/programming/graphics/9780321545411/gpu-computing-for-protein-structure-prediction/ch43lev1sec2#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTk3ODAzMjE1NDU0MTEvNDg3
http://www.gpucomputing.net/?q=node/1203
http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter43.html
Even though, as implemented in Python, Floyd-Warshall was slower by an order of magnitude, a good GPU version on a powerful GPU might still significantly outperform your new Python code.
Here's an anecdote. I had a short, simple, compute-intensive piece of code that did something similar to a hough accumulation. In Python, optimized as I could get it, it took ~7s on a speedy i7. I then wrote a completely non-optimized GPU version; it took ~0.002s on an Nvidia GTX 480. YMMV, but for anything significantly parallel, the GPU is likely to be a long term winner, and since it's a well-studied algorithm, you should be able to utilize existing highly-tuned code.
For the Python / GPU bridge, I'd recommend PyCUDA or PyOpenCL.
node_after_b == node_a
will try to call node_after_b.__eq__(node_a)
:
>>> class B(object):
... def __eq__(self, other):
... print "B.__eq__()"
... return False
...
>>> class A(object):
... def __eq__(self, other):
... print "A.__eq__()"
... return False
...
>>> a = A()
>>> b = B()
>>> a == b
A.__eq__()
False
>>> b == a
B.__eq__()
False
>>>
Try to override Node.__eq__()
with an optimized version before resorting to C.
UPDATE
I made this little experiment (python 2.6.6):
#!/usr/bin/env python
# test.py
class A(object):
def __init__(self, id):
self.id = id
class B(A):
def __eq__(self, other):
return self.id == other.id
@profile
def main():
list_a = []
list_b = []
for x in range(100000):
list_a.append(A(x))
list_b.append(B(x))
ob_a = A(1)
ob_b = B(1)
for ob in list_a:
if ob == ob_a:
x = True
if ob is ob_a:
x = True
if ob.id == ob_a.id:
x = True
if ob.id == 1:
x = True
for ob in list_b:
if ob == ob_b:
x = True
if ob is ob_b:
x = True
if ob.id == ob_b.id:
x = True
if ob.id == 1:
x = True
if __name__ == '__main__':
main()
Results:
Timer unit: 1e-06 s
File: test.py Function: main at line 10 Total time: 5.52964 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
10 @profile
11 def main():
12 1 5 5.0 0.0 list_a = []
13 1 3 3.0 0.0 list_b = []
14 100001 360677 3.6 6.5 for x in range(100000):
15 100000 763593 7.6 13.8 list_a.append(A(x))
16 100000 924822 9.2 16.7 list_b.append(B(x))
17
18 1 14 14.0 0.0 ob_a = A(1)
19 1 5 5.0 0.0 ob_b = B(1)
20 100001 500454 5.0 9.1 for ob in list_a:
21 100000 267252 2.7 4.8 if ob == ob_a:
22 x = True
23 100000 259075 2.6 4.7 if ob is ob_a:
24 x = True
25 100000 539683 5.4 9.8 if ob.id == ob_a.id:
26 1 3 3.0 0.0 x = True
27 100000 271519 2.7 4.9 if ob.id == 1:
28 1 3 3.0 0.0 x = True
29 100001 296736 3.0 5.4 for ob in list_b:
30 100000 472204 4.7 8.5 if ob == ob_b:
31 1 4 4.0 0.0 x = True
32 100000 283165 2.8 5.1 if ob is ob_b:
33 x = True
34 100000 298839 3.0 5.4 if ob.id == ob_b.id:
35 1 3 3.0 0.0 x = True
36 100000 291576 2.9 5.3 if ob.id == 1:
37 1 3 3.0 0.0 x = True
I was very surprised:
Then I tried with more complex objects and results are consistent with the first experiment.
Are you swapping a lot? If your dataset is so large that it does not fit available RAM, I guess you may experience some kind of I/O contention related to virtual memory fetches.
Are you running Linux? If so, could you post a vmstat of your machine while running your program? Send us the output of something like:
vmstat 10 100
Good luck!
UPDATE (from comments by OP)
I sugested playing with sys.setcheckinterval and enable/disable the GC. The rationale is that for this particular case (huge number of instances) the default GC reference count check is somewhat expensive and its default interval is away too often.
Yes, I had previously played with sys.setcheckinterval. I changed it to 1000 (from its default of 100), but it didn't do any measurable difference. Disabling Garbage Collection has helped - thanks. This has been the biggest speedup so far - saving about 20% (171 minutes for the whole run, down to 135 minutes) - I'm not sure what the error bars are on that, but it must be a statistically significant increase. – Adam Nellis Feb 9 at 15:10
My guess:
I think the Python GC is based on reference count. From time to time it will check the reference count for every instance; since you are traversing these huge in-memory structures, in your particular case the GC default frequency (1000 cycles?) is away too often - a huge waste. – Yours Truly Feb 10 at 2:06
I don't see anything wrong with your code regarding performance (without trying to grok the algorithm), you are just getting hit by the big number of iterations. Parts of your code get executed 40 million times!
Notice how 80% of the time is spent in 20% of your code - and those are the 13 lines that get executed 24+ million times. By the way with this code you provide great illustration to the Pareto principle (or "20% of beer drinkers drink 80% of the beer").
First things first: have you tried Psycho? It's a JIT compiler that can greatly speed up your code - considering the big number of iterations - say by a factor of 4x-5x - and all you have to do (after downloading and installing, of course) is to insert this snippet in the beginning:
import psyco
psyco.full()
This is why i liked Psycho and used it in GCJ too, where time is of essence - nothing to code, nothing to get wrong and sudden boost from 2 lines added.
Back to nit-picking (which changes like replacing ==
with is
etc is, because of the small % time improvement). Here they are the 13 lines "at fault":
Line # Hits Time Per Hit % Time Line Contents
412 42350234 197075504439 4653.5 8.1 for node_c, (distance_b_c, node_after_b) in node_b_distances.items(): # Can't use iteritems() here, as deleting from the dictionary
386 42076729 184216680432 4378.1 7.6 for node_c, (distance_a_c, node_after_a) in self.node_distances[node_a].iteritems():
362 41756882 183992040153 4406.3 7.6 for node_c, (distance_b_c, node_after_b) in node_b_distances.iteritems(): # Think it's ok to modify items while iterating over them (just not insert/delete) (seems to work ok)
413 41838114 180297579789 4309.4 7.4 if(distance_b_c > cutoff_distance):
363 41244762 172425596985 4180.5 7.1 if(node_after_b == node_a):
389 41564609 172040284089 4139.1 7.1 if(node_c == node_b): # a-b path
388 41564609 171150289218 4117.7 7.1 node_b_update = False
391 41052489 169406668962 4126.6 7 elif(node_after_a == node_b): # a-b-a-b path
405 41564609 164585250189 3959.7 6.8 if node_b_update:
394 24004846 103404357180 4307.6 4.3 (distance_b_c, node_after_b) = node_b_distances[node_c]
395 24004846 102717271836 4279 4.2 if(node_after_b != node_a): # b doesn't already go to a first
393 24801082 101577038778 4095.7 4.2 elif(node_c in node_b_distances): # b can already get to c
A) Besides the lines you mention, i notice that #388 has relatively high time when it is trivial, all it does it node_b_update = False
. Oh but wait - each time it gets executed, False
gets looked up in the global scope! To avoid that, assign F, T = False, True
in th e beginning of the method and replace later uses of False
and True
with locals F
and T
. This should decrease overall time, although by little (3%?).
B) I notice that the condition in #389 occurred "only" 512,120 times (based on number of executions of #390) vs the condition in #391 with 16,251,407. Since there is no dependency, it makes sense to reverse the order of those checks - because of the early "cut" that should give little boost (2%?). I am not sure if avoiding pass
statements altogether will help but if it does not hurt readability:
if (node_after_a is not node_b) and (node_c is not node_b):
# neither a-b-a-b nor a-b path
if (node_c in node_b_distances): # b can already get to c
(distance_b_c, node_after_b) = node_b_distances[node_c]
if (node_after_b is not node_a): # b doesn't already go to a first
distance_b_a_c = neighbour_distance_b_a + distance_a_c
if (distance_b_a_c < distance_b_c): # quicker to go via a
node_b_update = T
else: # b can't already get to c
distance_b_a_c = neighbour_distance_b_a + distance_a_c
if (distance_b_a_c < cutoff_distance): # not too for to go
node_b_update = T
C) I just noticed you are using try-except
in a case (#365-367) you just need default value from a dictionary - try using instead .get(key, defaultVal)
or create your dictionaries with collections.defaultdict(itertools.repeat(float('+inf')))
. Using try-except has it's price - see #365 reports 3.5% of the time, that's setting up stack frames and whatnot.
D) Avoid indexed access (be it with obj.
field or obj[
idx]
) when possible. For example i see you use self.node_distances[node_a]
in multiple places (#336, 339, 346, 366, 386), which means for every use indexing is used twice (once for .
and once for []
) - and that gets expensive when executed tens of millions of times. Seems to me you can just do at the method beginning node_a_distances = self.node_distances[node_a]
and then use that further.
I would have posted this as an update to my question, but Stack Overflow only allows 30000 characters in questions, so I'm posting this as an answer.
Update: My best optimisations so far
I've taken on board people's suggestions, and now my code runs about 21% faster than before, which is good - thanks everyone!
This is the best I've managed to do so far. I've replaced all the ==
tests with is
for nodes, disabled garbage collection and re-written the big if
statement part at Line 388, in line with @Nas Banov's suggestions. I added in the well-known try/except
trick for avoiding tests (line 390 - to remove the test node_c in node_b_distances
), which helped loads, since it hardly ever throws the exception. I tried switching lines 391 and 392 around, and assigning node_b_distances[node_c]
to a variable, but this way was the quickest.
However, I still haven't tracked down the memory leak yet (see graph in my question). But I think this might be in a different part of my code (that I haven't posted here). If I can fix the memory leak, then this program will run quickly enough for me to use :)
Timer unit: 3.33366e-10 s
File: routing_distances.py
Function: propagate_distances_node at line 328
Total time: 760.74 s
Line # Hits Time Per Hit % Time Line Contents
328 @profile
329 def propagate_distances_node(self, node_a, cutoff_distance=200):
330
331 # a makes sure its immediate neighbours are correctly in its distance table
332 # because its immediate neighbours may change as binds/folding change
333 791349 4158169713 5254.5 0.2 for (node_b, neighbour_distance_b_a) in self.neighbours[node_a].iteritems():
334 550522 2331886050 4235.8 0.1 use_neighbour_link = False
335
336 550522 2935995237 5333.1 0.1 if(node_b not in self.node_distances[node_a]): # a doesn't know distance to b
337 15931 68829156 4320.5 0.0 use_neighbour_link = True
338 else: # a does know distance to b
339 534591 2728134153 5103.2 0.1 (node_distance_b_a, next_node) = self.node_distances[node_a][node_b]
340 534591 2376374859 4445.2 0.1 if(node_distance_b_a > neighbour_distance_b_a): # neighbour distance is shorter
341 78 347355 4453.3 0.0 use_neighbour_link = True
342 534513 3145889079 5885.5 0.1 elif((None is next_node) and (float('+inf') == neighbour_distance_b_a)): # direct route that has just broken
343 74 327600 4427.0 0.0 use_neighbour_link = True
344
345 550522 2414669022 4386.1 0.1 if(use_neighbour_link):
346 16083 81850626 5089.3 0.0 self.node_distances[node_a][node_b] = (neighbour_distance_b_a, None)
347 16083 87064200 5413.4 0.0 self.nodes_changed.add(node_a)
348
349 ## Affinity distances update
350 16083 86580603 5383.4 0.0 if((node_a.type == Atom.BINDING_SITE) and (node_b.type == Atom.BINDING_SITE)):
351 234 6656868 28448.2 0.0 self.add_affinityDistance(node_a, node_b, self.chemistry.affinity(node_a.data, node_b.data))
352
353 # a sends its table to all its immediate neighbours
354 791349 4034651958 5098.4 0.2 for (node_b, neighbour_distance_b_a) in self.neighbours[node_a].iteritems():
355 550522 2392248546 4345.4 0.1 node_b_changed = False
356
357 # b integrates a's distance table with its own
358 550522 2520330696 4578.1 0.1 node_b_chemical = node_b.chemical
359 550522 2734341975 4966.8 0.1 node_b_distances = node_b_chemical.node_distances[node_b]
360
361 # For all b's routes (to c) that go to a first, update their distances
362 46679347 222161837193 4759.3 9.7 for node_c, (distance_b_c, node_after_b) in node_b_distances.iteritems(): # Think it's ok to modify items while iterating over them (just not insert/delete) (seems to work ok)
363 46128825 211963639122 4595.0 9.3 if(node_after_b is node_a):
364
365 18677439 79225517916 4241.8 3.5 try:
366 18677439 101527287264 5435.8 4.4 distance_b_a_c = neighbour_distance_b_a + self.node_distances[node_a][node_c][0]
367 181510 985441680 5429.1 0.0 except KeyError:
368 181510 1166118921 6424.5 0.1 distance_b_a_c = float('+inf')
369
370 18677439 89626381965 4798.6 3.9 if(distance_b_c != distance_b_a_c): # a's distance to c has changed
371 692131 3352970709 4844.4 0.1 node_b_distances[node_c] = (distance_b_a_c, node_a)
372 692131 3066946866 4431.2 0.1 node_b_changed = True
373
374 ## Affinity distances update
375 692131 3808548270 5502.6 0.2 if((node_b.type == Atom.BINDING_SITE) and (node_c.type == Atom.BINDING_SITE)):
376 96794 1655818011 17106.6 0.1 node_b_chemical.add_affinityDistance(node_b, node_c, self.chemistry.affinity(node_b.data, node_c.data))
377
378 # If distance got longer, then ask b's neighbours to update
379 ## TODO: document this!
380 18677439 88838493705 4756.5 3.9 if(distance_b_a_c > distance_b_c):
381 #for (node, neighbour_distance) in node_b_chemical.neighbours[node_b].iteritems():
382 1656796 7949850642 4798.3 0.3 for node in node_b_chemical.neighbours[node_b]:
383 1172486 6307264854 5379.4 0.3 node.chemical.nodes_changed.add(node)
384
385 # Look for routes from a to c that are quicker than ones b knows already
386 46999631 227198060532 4834.0 10.0 for node_c, (distance_a_c, node_after_a) in self.node_distances[node_a].iteritems():
387
388 46449109 218024862372 4693.8 9.6 if((node_after_a is not node_b) and # not a-b-a-b path
389 28049321 126269403795 4501.7 5.5 (node_c is not node_b)): # not a-b path
390 27768341 121588366824 4378.7 5.3 try: # Assume node_c in node_b_distances ('try' block will raise KeyError if not)
391 27768341 159413637753 5740.8 7.0 if((node_b_distances[node_c][1] is not node_a) and # b doesn't already go to a first
392 8462467 51890478453 6131.8 2.3 ((neighbour_distance_b_a + distance_a_c) < node_b_distances[node_c][0])):
393
394 # Found a route
395 224593 1168129548 5201.1 0.1 node_b_distances[node_c] = (neighbour_distance_b_a + distance_a_c, node_a)
396 ## Affinity distances update
397 224593 1274631354 5675.3 0.1 if((node_b.type == Atom.BINDING_SITE) and (node_c.type == Atom.BINDING_SITE)):
398 32108 551523249 17177.1 0.0 node_b_chemical.add_affinityDistance(node_b, node_c, self.chemistry.affinity(node_b.data, node_c.data))
399 224593 1165878108 5191.1 0.1 node_b_changed = True
400
401 809945 4449080808 5493.1 0.2 except KeyError:
402 # b can't already get to c (node_c not in node_b_distances)
403 809945 4208032422 5195.5 0.2 if((neighbour_distance_b_a + distance_a_c) < cutoff_distance): # not too for to go
404
405 # These lines of code copied, for efficiency
406 # (most of the time, the 'try' block succeeds, so don't bother testing for (node_c in node_b_distances))
407 # Found a route
408 587726 3162939543 5381.7 0.1 node_b_distances[node_c] = (neighbour_distance_b_a + distance_a_c, node_a)
409 ## Affinity distances update
410 587726 3363869061 5723.5 0.1 if((node_b.type == Atom.BINDING_SITE) and (node_c.type == Atom.BINDING_SITE)):
411 71659 1258910784 17568.1 0.1 node_b_chemical.add_affinityDistance(node_b, node_c, self.chemistry.affinity(node_b.data, node_c.data))
412 587726 2706161481 4604.5 0.1 node_b_changed = True
413
414
415
416 # If any of node b's rows have exceeded the cutoff distance, then remove them
417 47267073 239847142446 5074.3 10.5 for node_c, (distance_b_c, node_after_b) in node_b_distances.items(): # Can't use iteritems() here, as deleting from the dictionary
418 46716551 242694352980 5195.0 10.6 if(distance_b_c > cutoff_distance):
419 200755 967443975 4819.0 0.0 del node_b_distances[node_c]
420 200755 930470616 4634.9 0.0 node_b_changed = True
421
422 ## Affinity distances update
423 200755 4717125063 23496.9 0.2 node_b_chemical.del_affinityDistance(node_b, node_c)
424
425 # If we've modified node_b's distance table, tell its chemical to update accordingly
426 550522 2684634615 4876.5 0.1 if(node_b_changed):
427 235034 1383213780 5885.2 0.1 node_b_chemical.nodes_changed.add(node_b)
428
429 # Remove any neighbours that have infinite distance (have just unbound)
430 ## TODO: not sure what difference it makes to do this here rather than above (after updating self.node_distances for neighbours)
431 ## but doing it above seems to break the walker's movement
432 791349 4367879451 5519.5 0.2 for (node_b, neighbour_distance_b_a) in self.neighbours[node_a].items(): # Can't use iteritems() here, as deleting from the dictionary
433 550522 2968919613 5392.9 0.1 if(neighbour_distance_b_a > cutoff_distance):
434 148 775638 5240.8 0.0 del self.neighbours[node_a][node_b]
435
436 ## Affinity distances update
437 148 2096343 14164.5 0.0 self.del_affinityDistance(node_a, node_b)
Have you considered Pyrex / Cython?
It compiles python to C and then to .pyd automatically, so it might speed things up a fair bit without much work.