Say I have a sorted list of strings as in:
[\'A\', \'B\' , \'B1\', \'B11\', \'B2\', \'B21\', \'B22\', \'C\', \'C1\', \'C11\', \'C2\']
Now I wan
If I understand correctly, your ultimate goal is to sort sub-sequences, while leaving alone the items that are not part of the sub-sequences.
In your example, the sub-sequence is defined as items starting with "B". Your example list happens to contain items in lexicographic order, which is a bit too convenient, and can be distracting from finding a generalized solution. Let's mix things up a little by using a different example list. How about:
['X', 'B2', 'B11', 'B22', 'B', 'B1', 'B21', 'C', 'Q1', 'C11', 'C2']
Here, the items are no longer ordered (at least I tried to organize them so that they are not), neither the ones starting with "B", nor the others. However, the items starting with "B" still form a single contiguous sub-sequence, occupying the single range 1-6 rather than split ranges for example as 0-3 and 6-7. This again might be distracting, I will address that aspect further down.
If I understand your ultimate goal correctly, you would like this list to get sorted like this:
['X', 'B', 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'Q1', 'C11', 'C2']
To make this work, we need a key function that will return a tuple, such that:
This can be implemented like this, and with some doctests:
def order_sublist(items):
"""
>>> order_sublist(['A', 'B2', 'B11', 'B22', 'B', 'B1', 'B21', 'C', 'C1', 'C11', 'C2'])
['A', 'B', 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'C1', 'C11', 'C2']
>>> order_sublist(['X', 'B2', 'B11', 'B22', 'B', 'B1', 'B21', 'C', 'Q1', 'C11', 'C2'])
['X', 'B', 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'Q1', 'C11', 'C2']
"""
def key():
ord1 = [0]
def inner(item):
if not item.startswith('B'):
ord1[0] += 1
return ord1[0],
return ord1[0], int(item[1:] or 0)
return inner
return sorted(items, key=key())
In this implementation, the items get sorted by these keys:
[(1,), (1, 2), (1, 11), (1, 22), (1, 0), (1, 1), (1, 21), (2,), (3,), (4,), (5,)]
The items not starting by "B" keep their order, thanks to the first value in the key tuple, and the items starting with "B" get sorted thanks to the second value of the key tuple.
This implementation contains a few tricks that are worth explaining:
The key
function returns a tuple of 1 or 2 elements, as explained earlier: the non-B items have one value, the B items have two.
The first value of the tuple is not exactly the original index, but it's good enough. The value before the first B item is 1, all the B items use the same value, and the values after the B get an incremented value every time. Since (1,) < (1, x) < (2,)
where x
can be anything, these keys will get sorted as we wanted them.
And now on to the "real" tricks :-)
What's up with the ord1 = [0]
and ord1[0] += 1
? This is a technique to change a non-local value in a function. Had I used simply ord1 = 0
and ord1 += 1
would not work, because ord1
is a primitive value defined outside of the function. Without the global
keyword it's neither visible nor reassignable. A primitive ord1
value inside the inner
function would shadow the outer primitive value. But ord1
being a list, it's visible inside inner
, and its content can be modified. Note that cannot be reassigned. If you replaced with ord1[0] += 1
as ord1 = [ord1[0] + 1]
which would result in the same value, it would not work, as in that case ord1
at the left side is a local variable, shadowing the ord1
in the outer scope, and not modifying its value.
What's up with the key
and inner
functions? I thought it would be neat if the key function we will pass to sorted
will be reusable. This simpler version works too:
def order_sublist(items):
ord1 = [0]
def inner(item):
if not item.startswith('B'):
ord1[0] += 1
return ord1[0],
return ord1[0], int(item[1:] or 0)
return sorted(items, key=inner)
The important difference is that if you wanted to use inner
twice, both uses would share the same ord1
list. Which can be acceptable, as longs as the integer value ord1[0]
doesn't overflow during the use. In this case you won't use the function twice, and even if you did probably there wouldn't be a risk of integer overflow, but as a matter of principle, it's nice to make the function clean and reusable by wrapping it as I did in my initial proposal. What the key
function does is simply initialize ord1 = [0]
in its scope, define the inner
function, and return the inner
function. This way ord1
is effectively private, thanks to the closure. Every time you call key()
, it returns a function that has its private, fresh ord1
value.
Last but not least, notice the doctests: the """ ... """
comment is more than just documentation, it's executable tests. The >>>
lines are code to execute in a Python shell, and the following lines are the expected output. If you have this program in a file called script.py
, you can run the tests with python -m doctest script.py
. When all tests pass, you get no output. When a test fails, you get a nice report. It's a great way to verify that your program works, through demonstrated examples. You can have multiple test cases, separated by blank lines, to cover interesting corner cases. In this example there are two test cases, with your original sorted input, and the modified unsorted input.
However, as @zero-piraeus has made an interesting remark:
I can see that your solution relies on
sorted()
scanning the list left-to-right (which is reasonable – I can't imagine TimSort is going to be replaced or radically changed any time soon – but not guaranteed by Python AFAIK, and there are sorting algorithms that don't work like that).
I tried to be self-critical and doubt that the scanning from left to right is reasonable. But I think it is. After all, the sorting really happens based on the keys, not the actual values. I think most likely Python does something like this:
[key(value) for value in input]
, visiting the values from left to right.zip
the list of keys with the original itemsreturn [t[1] for t in zipped]
When building the list of key values,
it could work on multiple threads,
let's say two, the first thread one populating the first half and the second thread populating the second half in parallel.
That would mess up the ord1[0] += 1
trick.
But I doubt it does this kind of optimization,
as it simply seems overkill.
But to eliminate any shadow of doubt, we can follow this alternative implementation strategy ourselves, though the solution becomes a bit more verbose:
def order_sublist(items):
"""
>>> order_sublist(['A', 'B2', 'B11', 'B22', 'B', 'B1', 'B21', 'C', 'C1', 'C11', 'C2'])
['A', 'B', 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'C1', 'C11', 'C2']
>>> order_sublist(['X', 'B2', 'B11', 'B22', 'B', 'B1', 'B21', 'C', 'Q1', 'C11', 'C2'])
['X', 'B', 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'Q1', 'C11', 'C2']
"""
ord1 = 0
zipped = []
for item in items:
if not item.startswith('B'):
ord1 += 1
zipped.append((ord1, item))
def key(item):
if not item[1].startswith('B'):
return item[0],
return item[0], int(item[1][1:] or 0)
return [v for _, v in sorted(zipped, key=key)]
Do note that thanks to the doctests, we have an easy way to verify that the alternative implementation still works as before.
What if you wanted this example list:
['X', 'B', 'B1', 'B11', 'B2', 'B22', 'C', 'Q1', 'C11', 'C2', 'B21']
To get sorted like this:
['X', 'B', 'B1', 'B2', 'B11', 'B21', 'C', 'Q1', 'C11', 'C2', 'B22']
That is, the items starting with "B" sorted by their numeric value, even when they don't form a contiguous sub-sequence?
That won't be possible with a magical key function. It certainly is possible though, with some more legwork. You could:
If you need help with this last implementation, let me know.