Splitting list into smaller lists of equal values

前端 未结 4 705
广开言路
广开言路 2021-01-15 18:22

I am looking to transform a list into smaller lists of equal values. An example I have is:

[\"a\", \"a\", \"a\", \"b\", \"b\", \"c\", \"c\", \"c\", \"c\"] 
<         


        
相关标签:
4条回答
  • 2021-01-15 18:48

    You could use collections.Counter

    >>> lst = ["a", "a", "a", "b", "b", "c", "c", "c", "c"]
    >>> import collections
    >>> collections.Counter(lst).most_common()
    [('c', 4), ('a', 3), ('b', 2)]
    

    This works even when the values are not ordered and provides a very compact representation which then you could expand if needed into lists:

    >>> [[i]*n for i,n in collections.Counter(lst).most_common()]
    [['c', 'c', 'c', 'c'], ['a', 'a', 'a'], ['b', 'b']]
    
    0 讨论(0)
  • 2021-01-15 18:51

    You could use itertools.groupby to solve the problem:

    >>> from itertools import groupby
    >>> [list(grp) for k, grp in groupby(["a", "a", "a", "b", "b", "c", "c", "c", "c"])]
    [['a', 'a', 'a'], ['b', 'b'], ['c', 'c', 'c', 'c']]
    

    It only groups consecutive equal elements but that seems enough in your case.

    0 讨论(0)
  • 2021-01-15 19:03

    Another manner to have your desired output is by using defaultdict from collections module (best time using this approach was: ~= 0.02s same as using groupby):

    from collections import defaultdict
    a = ["a", "a", "a", "b", "b", "c", "c", "c", "c"]
    b = defaultdict(list)
    for k in a:
        b[k].append(k)
    
    >>> b 
    defaultdict(list,
                {'a': ['a', 'a', 'a'], 'b': ['b', 'b'], 'c': ['c', 'c', 'c', 'c']})
    

    So, what you have to do now is:

    list(b.values())
    >>> [['a', 'a', 'a'], ['b', 'b'], ['c', 'c', 'c', 'c']]
    
    0 讨论(0)
  • 2021-01-15 19:05

    While I'd personally go with itertools.groupby as the most convinient way, you've asked for efficiency and this should be considerably faster than any of the itertools options:

    data = ["a", "a", "a", "b", "b", "c", "c", "c", "c"] 
    
    lookup = {}  # lookup map
    result = []
    for element in data:
        if element not in lookup:
            target = lookup[element] = [element]
            result.append(target)
        else:
            lookup[element].append(element)
    
    print(result)
    # [['a', 'a', 'a'], ['b', 'b'], ['c', 'c', 'c', 'c']]
    

    If the data is always ordered (i.e. elements don't mix) this can be further optimized without a lookup table and using a list comprehension for maximum performance.

    UPDATE - Some clarification on efficiency and operation. If you set up your test as:

    from itertools import groupby
    
    def itools_func(data):
        return [list(grp) for k, grp in groupby(data)]
    
    def manual_func(data):
        lookup = {}
        result = []
        for element in data:
            if element not in lookup:
                target = lookup[element] = [element]
                result.append(target)
            else:
                lookup[element].append(element)
        return result
    

    The problem is that those two will not return the same values:

    test_data = ["a", "a", "b", "c", "c", "b", "a"]
    
    itools_func(test_data)  # [['a', 'a'], ['b'], ['c', 'c'], ['b'], ['a']]
    manual_func(test_data)  # [['a', 'a', 'a'], ['b', 'b'], ['c', 'c']]
    

    From OP's question, I understood he wants the latter one (based on his comment "I sorted the list to make values consecutive") because with a sorted list this can be done far easier. So, if we feed those functions a really long list:

    test_data = ["a", "a", "a", "b", "b", "c", "c", "c", "c"] * 10000  # 10000 x the original
    

    On my system it clocks as follows:

    itools_func - 100 loops: 2.668s, per loop: 26.68ms
    manual_func - 100 loops: 1.005s, per loop: 10.05ms
    

    But, this is an unfavorable setting for itertools.groopby. If the data was to be sorted like:

    test_data = ["a"] * 3000 + ["b"] * 2000 + ["c"] * 40000
    

    The story is quite a bit different as the C backend kicks in:

    itools_func - 1000 loops: 656.3ms, per loop: 656.3µs
    manual_func - 1000 loops: 4.816s, per loop: 4.816ms
    

    When the data is sorted the manual function can be further optimized but it will hardly beat what itertools does under the hood.

    0 讨论(0)
提交回复
热议问题