deepcopy and python - tips to avoid using it?

前端 未结 2 1881
感动是毒
感动是毒 2020-12-23 22:39

I have a very simple python routine that involves cycling through a list of roughly 20,000 latitude,longitude coordinates and calculating the distance of each point to a ref

相关标签:
2条回答
  • 2020-12-23 23:16

    I understand this doesn't directly address your question (and I know this is an old question), but since there is some discussion about performance it may be worthwhile to look at the append operation. You may want to consider "pre-allocating" the array. For example:

    array = [None] * num_elements
    for i in range(num_elements):
        array[i] = True
    

    versus:

    array = []
    for i in range(num_elements):
        array.append(True)
    

    A simple timeit run of these two methods shows a 25% improvement if you pre-allocate the array for even moderate values of num_elements.

    0 讨论(0)
  • 2020-12-23 23:28

    Okay, simplest things first:

    1. deepcopy is slow in general since it has to do a lot of internal bookkeeping to copy pathological cases like objects containing themselves in a sane way. See, for instance, this page, or take a look at the source code of deepcopy in copy.py that is somewhere in your Python path.

    2. sorted is fast, since it is implemented in C. Much faster than an equivalent sort in Python.

    Now, some more thing about Python's reference counting behaviour as you asked in your comment. In Python, variables are references. When you say a=1, think about it has having 1 as an object existing on its own, and a is just a tag attached to it. In some other languages like C, variables are containers (not tags), and when you do a=1, you actually put 1 into a. This does not hold for Python, where variables are references. This has some interesting consequences that you may have also stumbled upon:

    >>> a = []      # construct a new list, attach a tag named "a" to it
    >>> b = a       # attach a tag named "b" to the object which is tagged by "a"
    >>> a.append(1) # append 1 to the list tagged by "a"
    >>> print b     # print the list tagged by "b"
    [1]
    

    This behaviour is seen because lists are mutable objects: you can modify a list after you have created it, and the modification is seen when accessing the list through any of the variables that refer to it. The immutable equivalents of lists are tuples:

    >>> a = ()      # construct a new tuple, attach a tag named "a" to it
    >>> b = a       # now "b" refers to the same empty tuple as "a"
    >>> a += (1, 2) # appending some elements to the tuple
    >>> print b
    ()
    

    Here, a += (1, 2) creates a new tuple from the existing tuple referred to by a, plus another tuple (1, 2) that is constructed on-the-fly, and a is adjusted to point to the new tuple, while of course b still refers to the old tuple. The same happens with simple numeric additions like a = a+2: in this case, the number originally pointed to by a is not mutated in any way, Python "constructs" a new number and moves a to point to the new number. So, in a nutshell: numbers, strings and tuples are immutable; lists, dicts and sets are mutable. User-defined classes are in general mutable unless you ensure explicitly that the internal state cannot be mutated. And there's frozenset, which is an immutable set. Plus many others of course :)

    I don't know why your original code didn't work, but probably you hit a behaviour related to the code snippet I've shown with the lists as your PointDistance class is also mutable by default. An alternative could be the namedtuple class from collections, which constructs a tuple-like object whose fields can also be accessed by names. For instance, you could have done this:

    from collections import namedtuple
    PointDistance = namedtuple("PointDistance", "point distance")
    

    This creates a PointDistance class for you that has two named fields: point and distance. In your main for loop, you can assign these appropriately. Since the point objects pointed to by the point fields won't be modified during the course of your for loop, and distance is a number (which is, by definition, immutable), you should be safe doing this way. But yes, in general, it seems like simply using sorted is faster since sorted is implemented in C. You might also be lucky with the heapq module, which implements a heap data structure backed by an ordinary Python list, therefore it lets you find the top k elements easily without having to sort the others. However, since heapq is also implemented in Python, chances are that sorted works better unless you have a whole lot of points.

    Finally, I'd like to add that I never had to use deepcopy so far, so I guess there are ways to avoid it in most cases.

    0 讨论(0)
提交回复
热议问题