list around groupby results in empty groups

前端 未结 3 601
别那么骄傲
别那么骄傲 2020-12-10 10:04

I was playing around to get a better feeling for itertools groupby, so I grouped a list of tuples by the number and tried to get a list of the resulting groups.

相关标签:
3条回答
  • 2020-12-10 10:17

    Summary: The reason is that itertools generally do not store data. They just consume an iterator. So when the outer iterator advances, the inner iterator must as well.

    Analogy: Imagine you are a flight attendant standing at the door, admitting a single line passengers to an aircraft. The passengers are arranged by boarding group but you can only see and admit them one at a time. Periodically, as people enter you will learn when one boarding group has ended and then next has begun.

    To advance to the next group, you're going to have to admit all the remaining passengers in the current group. You can't see what is downstream in line without letting all the current passengers through.

    Unix comparison: The design of groupby() is algorithmically similar to the Unix uniq utility.

    What the docs say: "The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible."

    How to use it: If the data is needed later, it should be stored as a list:

    groups = []
    uniquekeys = []
    data = sorted(data, key=keyfunc)
    for k, g in groupby(data, keyfunc):
        groups.append(list(g))      # Store group iterator as a list
        uniquekeys.append(k)
    
    0 讨论(0)
  • 2020-12-10 10:36

    groupby is super lazy. Here's an illuminating demo. Let's group three a-values and four b-values, and print out what's happening:

    >>> from itertools import groupby
    >>> def letters():
            for letter in 'a', 'a', 'a', 'b', 'b', 'b', 'b':
                print('yielding', letter)
                yield letter
    


    Going through the groups WITHOUT looking at their members

    Let's roll:

    >>> groups = groupby(letters())
    >>> 
    

    Nothing got printed yet! So until now, groupby did nothing. What a lazy bum. Let's ask it for the first group:

    >>> next(groups)
    yielding a
    ('a', <itertools._grouper object at 0x05A16050>)
    

    So groupby tells us that this is a group of a-values, and we could go through that _grouper object to get them all. But wait, why did "yielding a" get printed only once? Our generator is yielding three of them, isn't it? Well, that's because groupby is lazy. It did read one value to identify the group, because it needs to tell us what the group is about, i.e., that it's a group of a-values. And it offers us that _grouper object for us to get all the group's members if we want to. But we didn't ask to go through the members, so the lazy bum didn't go any further. It simply didn't have a reason to. Let's ask for the next group:

    >>> next(groups)
    yielding a
    yielding a
    yielding b
    ('b', <itertools._grouper object at 0x05A00FD0>)
    

    Wait, what? Why "yielding a" when we're now dealing with the second group, the group of b-values? Well, because groupby previously stopped after the first a because that was enough to give us all we had asked for. But now, to tell us about the second group, it has to find the second group, and for this it asks our generator until it sees something other than a. Note that "yielding b" is again only printed once, even though our generator yields four of them. Let's ask for the third group:

    >>> next(groups)
    yielding b
    yielding b
    yielding b
    Traceback (most recent call last):
      File "<pyshell#32>", line 1, in <module>
        next(groups)
    StopIteration
    

    Ok so there is no third group and thus groupby issues a StopIteration so the consumer (e.g., a loop or list comprehension) would know to stop. But before that, the remaining "yielding b" get printed, because groupby got off its lazy butt and walked over the remaining values in hopes to find a new group.


    Going through the groups WITH looking at their members

    Let's try again, this time let's ask for the members:

    >>> groups = groupby(letters())
    >>> key, members = next(groups)
    yielding a
    >>> key
    'a'
    

    Again, groupby asked our generator for just a single value, in order to identify the group so it can tell us that it's an a-group. But this time, we'll also ask for the group members:

    >>> list(members)
    yielding a
    yielding a
    yielding b
    ['a', 'a', 'a']
    

    Aha! There are the remaining "yielding a". Also, already the first "yielding b"! Even though we didn't even ask for the second group yet! But of course groupby has to go this far because we asked for the group members, so it has to keep looking until it gets a non-member. Let's get the next group:

    >>> key, members = next(groups)
    >>> 
    

    Wait, what? Nothing got printed at all? Is groupby sleeping? Wake up! Oh wait... that's right... it already found out that the next group is b-values. Let's ask for all of them:

    >>> list(members)
    yielding b
    yielding b
    yielding b
    ['b', 'b', 'b', 'b']
    

    Now the remaining three "yielding b" happen, because we asked for them so groupby has to get them.


    Why doesn't it work to get the group members afterwards?

    Let's try it your initial way with list(groupby(...)):

    >>> groups = list(groupby(letters()))
    yielding a
    yielding a
    yielding a
    yielding b
    yielding b
    yielding b
    yielding b
    >>> [list(members) for key, members in groups]
    [[], ['b']]
    

    Note that not only is the first group empty, but also, the second group only has one element (you didn't mention that).

    Why?

    Again: groupby is super lazy. It offers you those _grouper objects so you can go through each group's members. But if you don't ask to see the group members and instead just ask for the next group to be identified, then groupby just shrugs and is like "Ok, you're the boss, I'll just go find the next group".

    What your list(groupby(...)) does is it asks groupby to identify all groups. So it does that. But if you then at the end ask for the members of each group, then groupby is like "Dude... I'm sorry, I offered them to you but you didn't want them. And I'm lazy, so I don't keep things around for no good reason. I can give you the last member of the last group, because I still remember that one, but for everything before that... sorry, I just don't have them anymore, you should've told me that you wanted them".

    P.S. In all of this, of course "lazy" really means "efficient". Not something bad but something good!

    0 讨论(0)
  • 2020-12-10 10:43

    From the itertools.groupby() documentation:

    The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible.

    Turning the output from groupby() into a list advances the groupby() object.


    Hence, you shouldn't be type-casting itertools.groupby object to list. If you want to store the values as list, then you should be doing something like this list comprehension in order to create copy of groupby object:

    grouped_l = [(a, list(b)) for a, b in itertools.groupby(l, key=lambda x:x[0])]
    

    This will allow you to iterate your list (transformed from groupby object) multiple times. However, if you are interested in only iterating the result once, then the second solution you mentioned in the question will suffice your requirement.

    0 讨论(0)
提交回复
热议问题