I am looking to transform a list into smaller lists of equal values. An example I have is:
[\"a\", \"a\", \"a\", \"b\", \"b\", \"c\", \"c\", \"c\", \"c\"]
<
You could use collections.Counter
>>> lst = ["a", "a", "a", "b", "b", "c", "c", "c", "c"]
>>> import collections
>>> collections.Counter(lst).most_common()
[('c', 4), ('a', 3), ('b', 2)]
This works even when the values are not ordered and provides a very compact representation which then you could expand if needed into lists:
>>> [[i]*n for i,n in collections.Counter(lst).most_common()]
[['c', 'c', 'c', 'c'], ['a', 'a', 'a'], ['b', 'b']]
You could use itertools.groupby to solve the problem:
>>> from itertools import groupby
>>> [list(grp) for k, grp in groupby(["a", "a", "a", "b", "b", "c", "c", "c", "c"])]
[['a', 'a', 'a'], ['b', 'b'], ['c', 'c', 'c', 'c']]
It only groups consecutive equal elements but that seems enough in your case.
Another manner to have your desired output is by using defaultdict
from collections
module (best time using this approach was: ~= 0.02s same as using groupby
):
from collections import defaultdict
a = ["a", "a", "a", "b", "b", "c", "c", "c", "c"]
b = defaultdict(list)
for k in a:
b[k].append(k)
>>> b
defaultdict(list,
{'a': ['a', 'a', 'a'], 'b': ['b', 'b'], 'c': ['c', 'c', 'c', 'c']})
So, what you have to do now is:
list(b.values())
>>> [['a', 'a', 'a'], ['b', 'b'], ['c', 'c', 'c', 'c']]
While I'd personally go with itertools.groupby
as the most convinient way, you've asked for efficiency and this should be considerably faster than any of the itertools
options:
data = ["a", "a", "a", "b", "b", "c", "c", "c", "c"]
lookup = {} # lookup map
result = []
for element in data:
if element not in lookup:
target = lookup[element] = [element]
result.append(target)
else:
lookup[element].append(element)
print(result)
# [['a', 'a', 'a'], ['b', 'b'], ['c', 'c', 'c', 'c']]
If the data is always ordered (i.e. elements don't mix) this can be further optimized without a lookup table and using a list comprehension for maximum performance.
UPDATE - Some clarification on efficiency and operation. If you set up your test as:
from itertools import groupby
def itools_func(data):
return [list(grp) for k, grp in groupby(data)]
def manual_func(data):
lookup = {}
result = []
for element in data:
if element not in lookup:
target = lookup[element] = [element]
result.append(target)
else:
lookup[element].append(element)
return result
The problem is that those two will not return the same values:
test_data = ["a", "a", "b", "c", "c", "b", "a"]
itools_func(test_data) # [['a', 'a'], ['b'], ['c', 'c'], ['b'], ['a']]
manual_func(test_data) # [['a', 'a', 'a'], ['b', 'b'], ['c', 'c']]
From OP's question, I understood he wants the latter one (based on his comment "I sorted the list to make values consecutive") because with a sorted list this can be done far easier. So, if we feed those functions a really long list:
test_data = ["a", "a", "a", "b", "b", "c", "c", "c", "c"] * 10000 # 10000 x the original
On my system it clocks as follows:
itools_func - 100 loops: 2.668s, per loop: 26.68ms
manual_func - 100 loops: 1.005s, per loop: 10.05ms
But, this is an unfavorable setting for itertools.groopby
. If the data was to be sorted like:
test_data = ["a"] * 3000 + ["b"] * 2000 + ["c"] * 40000
The story is quite a bit different as the C backend kicks in:
itools_func - 1000 loops: 656.3ms, per loop: 656.3µs
manual_func - 1000 loops: 4.816s, per loop: 4.816ms
When the data is sorted the manual function can be further optimized but it will hardly beat what itertools
does under the hood.