I have a very simple python routine that involves cycling through a list of roughly 20,000 latitude,longitude coordinates and calculating the distance of each point to a ref
I understand this doesn't directly address your question (and I know this is an old question), but since there is some discussion about performance it may be worthwhile to look at the append
operation. You may want to consider "pre-allocating" the array. For example:
array = [None] * num_elements
for i in range(num_elements):
array[i] = True
versus:
array = []
for i in range(num_elements):
array.append(True)
A simple timeit
run of these two methods shows a 25% improvement if you pre-allocate the array for even moderate values of num_elements
.
Okay, simplest things first:
deepcopy
is slow in general since it has to do a lot of internal bookkeeping to copy pathological cases like objects containing themselves in a sane way. See, for instance, this page, or take a look at the source code of deepcopy
in copy.py
that is somewhere in your Python path.
sorted
is fast, since it is implemented in C. Much faster than an equivalent sort in Python.
Now, some more thing about Python's reference counting behaviour as you asked in your comment. In Python, variables are references. When you say a=1
, think about it has having 1
as an object existing on its own, and a
is just a tag attached to it. In some other languages like C, variables are containers (not tags), and when you do a=1
, you actually put 1 into a
. This does not hold for Python, where variables are references. This has some interesting consequences that you may have also stumbled upon:
>>> a = [] # construct a new list, attach a tag named "a" to it
>>> b = a # attach a tag named "b" to the object which is tagged by "a"
>>> a.append(1) # append 1 to the list tagged by "a"
>>> print b # print the list tagged by "b"
[1]
This behaviour is seen because lists are mutable objects: you can modify a list after you have created it, and the modification is seen when accessing the list through any of the variables that refer to it. The immutable equivalents of lists are tuples:
>>> a = () # construct a new tuple, attach a tag named "a" to it
>>> b = a # now "b" refers to the same empty tuple as "a"
>>> a += (1, 2) # appending some elements to the tuple
>>> print b
()
Here, a += (1, 2)
creates a new tuple from the existing tuple referred to by a
, plus another tuple (1, 2)
that is constructed on-the-fly, and a
is adjusted to point to the new tuple, while of course b
still refers to the old tuple. The same happens with simple numeric additions like a = a+2
: in this case, the number originally pointed to by a
is not mutated in any way, Python "constructs" a new number and moves a
to point to the new number. So, in a nutshell: numbers, strings and tuples are immutable; lists, dicts and sets are mutable. User-defined classes are in general mutable unless you ensure explicitly that the internal state cannot be mutated. And there's frozenset
, which is an immutable set. Plus many others of course :)
I don't know why your original code didn't work, but probably you hit a behaviour related to the code snippet I've shown with the lists as your PointDistance
class is also mutable by default. An alternative could be the namedtuple
class from collections
, which constructs a tuple-like object whose fields can also be accessed by names. For instance, you could have done this:
from collections import namedtuple
PointDistance = namedtuple("PointDistance", "point distance")
This creates a PointDistance
class for you that has two named fields: point
and distance
. In your main for
loop, you can assign these appropriately. Since the point objects pointed to by the point
fields won't be modified during the course of your for
loop, and distance
is a number (which is, by definition, immutable), you should be safe doing this way. But yes, in general, it seems like simply using sorted
is faster since sorted
is implemented in C. You might also be lucky with the heapq
module, which implements a heap data structure backed by an ordinary Python list, therefore it lets you find the top k
elements easily without having to sort the others. However, since heapq
is also implemented in Python, chances are that sorted
works better unless you have a whole lot of points.
Finally, I'd like to add that I never had to use deepcopy
so far, so I guess there are ways to avoid it in most cases.