Python: yield-and-delete

后端 未结 4 1787
逝去的感伤
逝去的感伤 2021-02-02 15:45

How do I yield an object from a generator and forget it immediately, so that it doesn\'t take up memory?

For example, in the following function:

def grou         


        
相关标签:
4条回答
  • 2021-02-02 16:13

    If you really really want to get this functionality I suppose you could use a wrapper:

    class Wrap:
    
        def __init__(self, val):
            self.val = val
    
        def unlink(self):
            val = self.val
            self.val = None
            return val
    

    And could be used like

    def grouper(iterable, chunksize):
        i = iter(iterable)
        while True:
            chunk = Wrap(list(itertools.islice(i, int(chunksize))))
            if not chunk.val:
                break
            yield chunk.unlink()
    

    Which is essentially the same as what phihag does with pop() ;)

    0 讨论(0)
  • 2021-02-02 16:27

    After yield chunk, the variable value is never used again in the function, so a good interpreter/garbage collector will already free chunk for garbage collection (note: cpython 2.7 seems not do this, pypy 1.6 with default gc does). Therefore, you don't have to change anything but your code example, which is missing the second argument to grouper.

    Note that garbage collection is non-deterministic in Python. The null garbage collector, which doesn't collect free objects at all, is a perfectly valid garbage collector. From the Python manual:

    Objects are never explicitly destroyed; however, when they become unreachable they may be garbage-collected. An implementation is allowed to postpone garbage collection or omit it altogether — it is a matter of implementation quality how garbage collection is implemented, as long as no objects are collected that are still reachable.

    Therefore, it can not be decided whether a Python program does or "doesn't take up memory" without specifying Python implementation and garbage collector. Given a specific Python implementation and garbage collector, you can use the gc module to test whether the object is freed.

    That being said, if you really want no reference from the function (not necessarily meaning the object will be garbage-collected), here's how to do it:

    def grouper(iterable, chunksize):
        i = iter(iterable)
        while True:
            tmpr = [list(itertools.islice(i, int(chunksize)))]
            if not tmpr[0]:
                break
            yield tmpr.pop()
    

    Instead of a list, you can also use any other data structure that with a function which removes and returns an object, like Owen's wrapper.

    0 讨论(0)
  • 2021-02-02 16:30

    The function grouper as defined has the artifact of creating wasteful duplicates because you have wrapped a function of no effect around itertools.islice. The solution is to delete your redundant code.

    I think there are concessions to C derived languages which are non-Pythonic and are causing excess overhead. For example, you have

    i = iter(iterable)
    itertools.islice(i)
    

    Why does i exist? iter will not cast a non-iterable into an iterable, there aren't such casts. If given a non-iterable, both of those lines would generate an exception; the first does not guard the second.

    islice will happliy act as and iterator (although may give economies that a yield statement won't. You've got too much code: grouper probably need not exist.

    0 讨论(0)
  • 2021-02-02 16:35

    @ Radim,

    Several points were perplexing me in this thread. I realize that I was missing to understand the base: what was your problem.

    Now I think that I've understood and I whish you to confirm.

    I'll represent your code like that

    import itertools
    
    def grouper(iterable, chunksize):
        i = iter(iterable)
        while True:
            chunk = list(itertools.islice(i, int(chunksize)))
            if not chunk:
                break
            yield chunk
    
    ............
    ............
    gigi = grouper(an_iterable,4)
    # before A
    # A = grouper(an_iterable,4)
    # corrected:
    A = gigi.next()
    # after A
    ................
    ...........
    # deducing an object x from A ; x doesn't consumes a lot of memory
    ............
    # deleting A because it consumes a lot of memory:
    del A
    # code carries on, taking time to executes
    ................
    ................
    ......
    ..........
    # before B
    # B = grouper(an_iterable,4)
    # corrected:
    B = gigi.next()
    # after B
    .....................
    ........
    

    Your problem is that even during the time elapsing between
    # after deletion of A, code carries on, taking time to executes
    and
    # before B ,
    the object of deleted name 'A' still exists and consumes a lot of memory because there is still a binding between this object and the identifier 'chunk' inside the generator function ?

    Excuse me to ask you about this now evident point to me.
    However, as there was a certain confusion in the thread at a time, I'd like you to confirm I have now correctly understood your problem.

    .

    @ phihag

    You wrote in a comment:

    1)
    After the yield chunk, there is no way to access the value stored in chunk from this function. Therefore, this function does not hold any references to the object in question

    (By the way, I wouldn't have written therefore , but 'because')

    I think that this affirmation #1 is debatable.
    In fact , I'm convinced it is false. But there is a subtlety in what you pretend, not in this quotation alone, but globally, if we take account of what you say in the beginning of your answer too.

    Let us take things in order.

    The following code seems to prove the contrary of your affirmation "After the yield chunk, there is no way to access the value stored in chunk from this function."

    import itertools
    
    def grouper(iterable, chunksize):
        i = iter(iterable)
        chunk = ''
        last = ''
        while True:
            print 'new turn   ',id(chunk)
            if chunk:
                last = chunk[-1]
            chunk = list(itertools.islice(i, int(chunksize)))
            print 'new chunk  ',id(chunk),'  len of chunk :',len(chunk)
            if not chunk:
                break
            yield '%s  -  %s' % (last,' , '.join(chunk))
            print 'end of turn',id(chunk),'\n'
    
    
    for x in grouper(['1','2','3','4','5','6','7','8','9','10','11'],'4'):
        print repr(x)
    

    result

    new turn    10699768
    new chunk   18747064   len of chunk : 4
    '  -  1 , 2 , 3 , 4'
    end of turn 18747064 
    
    new turn    18747064
    new chunk   18777312   len of chunk : 4
    '4  -  5 , 6 , 7 , 8'
    end of turn 18777312 
    
    new turn    18777312
    new chunk   18776952   len of chunk : 3
    '8  -  9 , 10 , 11'
    end of turn 18776952 
    
    new turn    18776952
    new chunk   18777512   len of chunk : 0
    

    .

    However, you also wrote (it's the beginning of your answer):

    2)
    After yield chunk, the variable value is never used again in the function, so a good interpreter/garbage collector will already free chunk for garbage collection (note: cpython 2.7 seems not do this, pypy 1.6 with default gc does).

    This time you don't say that the function hold no more reference of chunk after yield chunk , you say that its value is not used again before the renewal of chunk in the next turn of the while loop. That's right, in the Radim's code, the object chunk isn't used again before the identifier 'chunk' is re-assigned in the instruction chunk = list(itertools.islice(i, int(chunksize))) in the next turn of the loop.

    .

    This affirmation #2 in this quotation, different from the preceding #1 one, has two logical consequences:

    FIRST , my above code can't pretend to prove strictly to someone thinking like you that there is indeed a way to access the value of chunk after the yield chunk instruction.
    Because the conditions in my above code are not the same under which you affirm the contrary, that is to say: in Radim's code about which you speak, the object chunk is really not used again before the next turn.
    And then , one can pretend that it's because of the use of chunk in my above code ( the instructions print 'end of turn',id(chunk),'\n' , print 'new turn ',id(chunk) and last = chunk[-1] do use it ) that it happens that a reference to the object chunk is still hold after the yield chunk.

    SECONDLY, going further in the reasoning, gathering your two quotations leads to conclude that you think it's because chunk is no more used after the yield chunk instruction in the Radim's code that no reference is maintained on it.
    It's a matter of logic, IMO: the absence of reference to an object is the condition of its freeing, hence if you pretend that the memory is freed from the object because it is no more used, it's equivalent to pretend that the memory is freed from the object because its unemployment makes the intepreter to delete the reference to it in the function.

    I sum up:
    you pretend that in Radim's code, chunk is no more used after yield chunk then no more reference to it is hold, then..... cpython 2.7 won't do it... but pypy 1.6 with default gc frees the memory from the object chunk.

    At this point , I'm very surprised by the reasoning at the source of this consequence: it would be because of the fact that chunk is no more used that pypy 1.6 would free it. This reasoning isn't clearly expressed like that by you, but without it I would find what you claim in the two quotations being illogical and incomprehensible.

    What perplexes me in this conclusion, and the reason I don't agree with all that, is that it implies that pypy 1.6 would be able to analyze the code and detect that chunk won't be used again after yield chunk. I find this idea completely unbelievable and I would like you :

    • to explain what you exactly think about all that. Where am I wrong in the comprehension of your ideas ?

    • to say if you have a proof of the fact that , at least pypy 1.6, doesn't hold reference to chunk when it is no more used.
      The problem of Radim's initial code is that the memory was too much consumed by the persistance of the object chunk because of its reference still hold inside the generator function: that was an indirect symptom of the existence of such a persistent reference inside.
      Have you observed a similar behavior with pypy 1.6 ? I don't see another way to put in evidence the remaining reference inside the generator, since , according to your quotation #2, any use of chunk after yield chunk is enough to trigger the upholding of a reference to it. It's a problem similar to one in quantic mechanics: the fact to measure the speed of a particle modifies its speed.....

    0 讨论(0)
提交回复
热议问题