I have a list named values containing a series of numbers:
values = [0, 1, 2, 3, 4, 5, ... , 351, 0, 1, 2, 3, 4, 5, 6, ... , 750, 0, 1, 2, 3, 4, 5, ... , 559]
>
So your problem with the code as written is that it includes an empty list
at the beginning, and omits the final sub-list
. The minimalist fix for this is:
Change the test to avoid appending the first list
(when i
is 0), e.g. if val == 0 and i != 0:
Append the final group after the loop exits
Combining the two fixes, you'd have:
start = 0
new_values = []
for i,val in enumerate(values):
if val == 0 and i != 0: # Avoid adding empty list
new_values.append(values[start:i])
start = i
if values: # Handle edgecase for empty values where nothing to add
new_values.append(values[start:]) # Add final list
I was going to add the cleaner groupby
solution which avoids the special cases for beginning/end of list
, but Chris_Rands already handled that, so I'll refer you to his answer.
Somewhat surprisingly, this actually seems to be the fastest solution, asymptotically, at the expense of requiring the input to be a list
(where some of the other solutions can accept arbitrary iterables, including pure iterators for which indexing is impossible).
For comparison (using Python 3.5 additional unpacking generalizations both for brevity and to get optimal performance on modern Python, and using the implicit booleanness of int
to avoid comparing to 0
since it's equivalent for int
input, but meaningfully faster to use implicit booleanness):
from itertools import *
# truth is the same as bool, but unlike the bool constructor, it requires
# exactly one positional argument, which makes a *major* difference
# on runtime when it's in a hot code path
from operator import truth
def method1(values):
# Optimized/correct OP's code
# Only works on list inputs, and requires non-empty values to begin with 0,
# but handles repeated 0s as separate groups properly
new_values = []
start = None
for i, val in enumerate(values):
if not val and i:
new_values.append(values[start:i])
start = i
if values:
new_values.append(values[start:])
return new_values
def method2(values):
# Works with arbitrary iterables and iterators, but doesn't handle
# repeated 0s or non-empty values that don't begin with 0
return [[0, *g] for k, g in groupby(values, truth) if k]
def method3(values):
# Same behaviors and limitations as method1, but without verbose
# special casing for begin and end
start_indices = [i for i, val in enumerate(values) if not val]
# End indices for all but terminal slice are previous start index
# so make iterator and discard first value to pair properly
end_indices = iter(start_indices)
next(end_indices, None)
# Pairing with zip_longest avoids need to explicitly pad end_indices
return [values[s:e] for s, e in zip_longest(start_indices, end_indices)]
def method4(values):
# Requires any non-empty values to begin with 0
# but otherwise handles runs of 0s and arbitrary iterables (including iterators)
new_values = []
for val in values:
if not val:
curlist = [val]
new_values.append(curlist)
# Use pre-bound method in local name for speed
curlist_append = curlist.append
else:
curlist_append(val)
return new_values
def method5(values):
# Most flexible solution; similar to method2, but handles all inputs, empty, non-empty,
# with or without leading 0, with or without runs of repeated 0s
new_values = []
for nonzero, grp in groupby(values, truth):
if nonzero:
try:
new_values[-1] += grp
except IndexError:
new_values.append([*grp]) # Only happens when values begins with nonzero
else:
new_values += [[0] for _ in grp]
return new_values
Timings on Python 3.6, Linux x64, using ipython
6.1's %timeit
magic:
>>> values = [*range(100), *range(50), *range(150)]
>>> %timeit -r5 method1(values)
12.5 μs ± 50.6 ns per loop (mean ± std. dev. of 5 runs, 100000 loops each)
>>> %timeit -r5 method2(values)
16.9 μs ± 54.9 ns per loop (mean ± std. dev. of 5 runs, 100000 loops each)
>>> %timeit -r5 method3(values)
13 μs ± 18.9 ns per loop (mean ± std. dev. of 5 runs, 100000 loops each)
>>> %timeit -r5 method4(values)
16.7 μs ± 9.51 ns per loop (mean ± std. dev. of 5 runs, 100000 loops each)
>>> %timeit -r5 method5(values)
18.2 μs ± 25.2 ns per loop (mean ± std. dev. of 5 runs, 100000 loops each)
Summary:
Solutions that slice out the runs in bulk (method1
, method3
) are the fastest, but depend on the input being a sequence (and if the return type must be list
, the input must be list
too, or conversions must be added).
groupby
solutions (method2
, method5
) are a little slower, but are typically quite succinct (handling all edges cases as in method5
doesn't require extreme verbosity nor explicit test-and-check LBYL patterns). They also don't require a lot of hackery to make them go as fast as possible, aside from using operator.truth
instead of bool
. This is necessary because CPython's bool
constructor is very slow thanks to some weird implementation details (bool
must accept full varargs, including keywords, dispatching through the object construction machinery, which costs a lot more than operator.truth
which uses a low overhead path that takes exactly one positional argument and bypasses object construction machinery); if bool
were used as the key
function instead of operator.truth
, runtimes more than double (to 36.8 μs and 38.8 μs for method2
and method5
respectively).
In between is the slower, but more flexible approach (handles arbitrary input iterables, including iterators, handles runs of 0s with no special casing, etc.) using item-by-item append
s (method4
). Problem is, getting maximum performance requires much more verbose code (because of the need to avoid repeated indexing and method binding); if the loop of method4
is changed to the much more succinct:
for val in values:
if not val:
new_values.append([])
new_values[-1].append(val)
the runtime more than doubles (to ~34.4 μs), thanks to the cost of repeatedly indexing new_values
and binding the append
method over and over.
In any event, personally, if performance wasn't absolutely critical, I'd use one of the groupby
solutions using bool
as the key
just to avoid imports and uncommon APIs. If performance was more important, I'd probably still use groupby
, but swap in operator.truth
as the key
function; sure, it's not as fast as the spelled out version, but for people who know groupby
, it's easy enough to follow, and it's generally the most succinct solution for any given level of edge case handling.