Sort a sublist of elements in a list leaving the rest in place

前端 未结 15 2302
小蘑菇
小蘑菇 2021-02-13 14:28

Say I have a sorted list of strings as in:

[\'A\', \'B\' , \'B1\', \'B11\', \'B2\', \'B21\', \'B22\', \'C\', \'C1\', \'C11\', \'C2\']

Now I wan

相关标签:
15条回答
  • 2021-02-13 15:06

    If I understand correctly, your ultimate goal is to sort sub-sequences, while leaving alone the items that are not part of the sub-sequences.

    In your example, the sub-sequence is defined as items starting with "B". Your example list happens to contain items in lexicographic order, which is a bit too convenient, and can be distracting from finding a generalized solution. Let's mix things up a little by using a different example list. How about:

    ['X', 'B2', 'B11', 'B22', 'B', 'B1', 'B21', 'C', 'Q1', 'C11', 'C2']
    

    Here, the items are no longer ordered (at least I tried to organize them so that they are not), neither the ones starting with "B", nor the others. However, the items starting with "B" still form a single contiguous sub-sequence, occupying the single range 1-6 rather than split ranges for example as 0-3 and 6-7. This again might be distracting, I will address that aspect further down.

    If I understand your ultimate goal correctly, you would like this list to get sorted like this:

    ['X', 'B', 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'Q1', 'C11', 'C2']
    

    To make this work, we need a key function that will return a tuple, such that:

    • First value:
      • If the item doesn't start with "B", then the index in the original list (or a value in the same order)
      • If the item starts with "B", then the index of the last item that didn't start with "B"
    • Second value:
      • If the item doesn't start with "B", then omit this
      • If the item starts with "B", then the numeric value

    This can be implemented like this, and with some doctests:

    def order_sublist(items):
        """
        >>> order_sublist(['A', 'B2', 'B11', 'B22', 'B', 'B1', 'B21', 'C', 'C1', 'C11', 'C2'])
        ['A', 'B', 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'C1', 'C11', 'C2']
    
        >>> order_sublist(['X', 'B2', 'B11', 'B22', 'B', 'B1', 'B21', 'C', 'Q1', 'C11', 'C2'])
        ['X', 'B', 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'Q1', 'C11', 'C2']
    
        """
        def key():
            ord1 = [0]
    
            def inner(item):
                if not item.startswith('B'):
                    ord1[0] += 1
                    return ord1[0],
                return ord1[0], int(item[1:] or 0)
            return inner
    
        return sorted(items, key=key())
    

    In this implementation, the items get sorted by these keys:

    [(1,), (1, 2), (1, 11), (1, 22), (1, 0), (1, 1), (1, 21), (2,), (3,), (4,), (5,)]
    

    The items not starting by "B" keep their order, thanks to the first value in the key tuple, and the items starting with "B" get sorted thanks to the second value of the key tuple.

    This implementation contains a few tricks that are worth explaining:

    • The key function returns a tuple of 1 or 2 elements, as explained earlier: the non-B items have one value, the B items have two.

    • The first value of the tuple is not exactly the original index, but it's good enough. The value before the first B item is 1, all the B items use the same value, and the values after the B get an incremented value every time. Since (1,) < (1, x) < (2,) where x can be anything, these keys will get sorted as we wanted them.

    And now on to the "real" tricks :-)

    • What's up with the ord1 = [0] and ord1[0] += 1 ? This is a technique to change a non-local value in a function. Had I used simply ord1 = 0 and ord1 += 1 would not work, because ord1 is a primitive value defined outside of the function. Without the global keyword it's neither visible nor reassignable. A primitive ord1 value inside the inner function would shadow the outer primitive value. But ord1 being a list, it's visible inside inner, and its content can be modified. Note that cannot be reassigned. If you replaced with ord1[0] += 1 as ord1 = [ord1[0] + 1] which would result in the same value, it would not work, as in that case ord1 at the left side is a local variable, shadowing the ord1 in the outer scope, and not modifying its value.

    • What's up with the key and inner functions? I thought it would be neat if the key function we will pass to sorted will be reusable. This simpler version works too:

      def order_sublist(items):
          ord1 = [0]
      
          def inner(item):
              if not item.startswith('B'):
                  ord1[0] += 1
                  return ord1[0],
              return ord1[0], int(item[1:] or 0)
      
          return sorted(items, key=inner)
      

      The important difference is that if you wanted to use inner twice, both uses would share the same ord1 list. Which can be acceptable, as longs as the integer value ord1[0] doesn't overflow during the use. In this case you won't use the function twice, and even if you did probably there wouldn't be a risk of integer overflow, but as a matter of principle, it's nice to make the function clean and reusable by wrapping it as I did in my initial proposal. What the key function does is simply initialize ord1 = [0] in its scope, define the inner function, and return the inner function. This way ord1 is effectively private, thanks to the closure. Every time you call key(), it returns a function that has its private, fresh ord1 value.

    • Last but not least, notice the doctests: the """ ... """ comment is more than just documentation, it's executable tests. The >>> lines are code to execute in a Python shell, and the following lines are the expected output. If you have this program in a file called script.py, you can run the tests with python -m doctest script.py. When all tests pass, you get no output. When a test fails, you get a nice report. It's a great way to verify that your program works, through demonstrated examples. You can have multiple test cases, separated by blank lines, to cover interesting corner cases. In this example there are two test cases, with your original sorted input, and the modified unsorted input.

    However, as @zero-piraeus has made an interesting remark:

    I can see that your solution relies on sorted() scanning the list left-to-right (which is reasonable – I can't imagine TimSort is going to be replaced or radically changed any time soon – but not guaranteed by Python AFAIK, and there are sorting algorithms that don't work like that).

    I tried to be self-critical and doubt that the scanning from left to right is reasonable. But I think it is. After all, the sorting really happens based on the keys, not the actual values. I think most likely Python does something like this:

    1. Take a list of the key values with [key(value) for value in input], visiting the values from left to right.
    2. zip the list of keys with the original items
    3. Apply whatever sorting algorithm on the zipped list, comparing items by the first value of the zip, and swapping items
    4. At the end, return the sorted items with return [t[1] for t in zipped]

    When building the list of key values, it could work on multiple threads, let's say two, the first thread one populating the first half and the second thread populating the second half in parallel. That would mess up the ord1[0] += 1 trick. But I doubt it does this kind of optimization, as it simply seems overkill.

    But to eliminate any shadow of doubt, we can follow this alternative implementation strategy ourselves, though the solution becomes a bit more verbose:

    def order_sublist(items):
        """
        >>> order_sublist(['A', 'B2', 'B11', 'B22', 'B', 'B1', 'B21', 'C', 'C1', 'C11', 'C2'])
        ['A', 'B', 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'C1', 'C11', 'C2']
    
        >>> order_sublist(['X', 'B2', 'B11', 'B22', 'B', 'B1', 'B21', 'C', 'Q1', 'C11', 'C2'])
        ['X', 'B', 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'Q1', 'C11', 'C2']
    
        """
        ord1 = 0
        zipped = []
        for item in items:
            if not item.startswith('B'):
                ord1 += 1
            zipped.append((ord1, item))
    
        def key(item):
            if not item[1].startswith('B'):
                return item[0],
            return item[0], int(item[1][1:] or 0)
    
        return [v for _, v in sorted(zipped, key=key)]
    

    Do note that thanks to the doctests, we have an easy way to verify that the alternative implementation still works as before.


    What if you wanted this example list:

    ['X', 'B', 'B1', 'B11', 'B2', 'B22', 'C', 'Q1', 'C11', 'C2', 'B21']
    

    To get sorted like this:

    ['X', 'B', 'B1', 'B2', 'B11', 'B21', 'C', 'Q1', 'C11', 'C2', 'B22']
    

    That is, the items starting with "B" sorted by their numeric value, even when they don't form a contiguous sub-sequence?

    That won't be possible with a magical key function. It certainly is possible though, with some more legwork. You could:

    1. Create a list with the original indexes of the items starting with "B"
    2. Create a list with the items starting with "B" and sort it with whatever way you like
    3. Write back the content of the sorted list at the original indexes

    If you need help with this last implementation, let me know.

    0 讨论(0)
  • 2021-02-13 15:07

    In the simple case where you just want to sort trailing digits numerically and their non-digit prefixes alphabetically, you need a key function which splits each item into non-digit and digit components as follows:

    'AB123' -> ['AB', 123]
    'CD'    -> ['CD']
    '456'   -> ['', 456]
    

    Note: In the last case, the empty string '' is not strictly necessary in CPython 2.x, as integers sort before strings – but that's an implementation detail rather than a guarantee of the language, and in Python 3.x it is necessary, because strings and integers can't be compared at all.

    You can build such a key function using a list comprehension and re.split():

    import re
    
    def trailing_digits(x):
       return [
           int(g) if g.isdigit() else g
           for g in re.split(r'(\d+)$', x)
       ]
    

    Here it is in action:

    >>> s1 = ['11', '2', 'A', 'B', 'B1', 'B11', 'B2', 'B21', 'C', 'C11', 'C2']
    

    >>> sorted(s1, key=trailing_digits)
    ['2', '11', 'A', 'B', 'B1', 'B2', 'B11', 'B21', 'C', 'C2', 'C11']
    

    Once you add the restriction that only strings with a particular prefix or prefixes have their trailing digits sorted numerically, things get a little more complicated.

    The following function builds and returns a key function which fulfils the requirement:

    def prefixed_digits(*prefixes):
        disjunction = '|'.join('^' + re.escape(p) for p in prefixes)
        pattern = re.compile(r'(?<=%s)(\d+)$' % disjunction)
        def key(x):
            return [
                int(g) if g.isdigit() else g
                for g in re.split(pattern, x)
            ]
        return key
    

    The main difference here is that a precompiled regex is created (containing a lookbehind constructed from the supplied prefix or prefixes), and a key function using that regex is returned.

    Here are some usage examples:

    >>> s2 = ['A', 'B', 'B11', 'B2', 'B21', 'C', 'C11', 'C2', 'D12', 'D2']
    

    >>> sorted(s2, key=prefixed_digits('B'))
    ['A', 'B', 'B2', 'B11', 'B21', 'C', 'C11', 'C2', 'D12', 'D2']
    

    >>> sorted(s2, key=prefixed_digits('B', 'C'))
    ['A', 'B', 'B2', 'B11', 'B21', 'C', 'C2', 'C11', 'D12', 'D2']
    

    >>> sorted(s2, key=prefixed_digits('B', 'D'))
    ['A', 'B', 'B2', 'B11', 'B21', 'C', 'C11', 'C2', 'D2', 'D12']
    

    If called with no arguments, prefixed_digits() returns a key function which behaves identically to trailing_digits:

    >>> sorted(s1, key=prefixed_digits())
    ['2', '11', 'A', 'B', 'B1', 'B2', 'B11', 'B21', 'C', 'C2', 'C11']
    

    Caveats:

    1. Due to a restriction in Python's re module regarding lookbhehind syntax, multiple prefixes must have the same length.

    2. In Python 2.x, strings which are purely numeric will be sorted numerically regardless of which prefixes are supplied to prefixed_digits(). In Python 3, they'll cause an exception (except when called with no arguments, or in the special case of key=prefixed_digits('') – which will sort purely numeric strings numerically, and prefixed strings alphabetically). Fixing that may be possible with a significantly more complex regex, but I gave up trying after about twenty minutes.

    0 讨论(0)
  • 2021-02-13 15:08

    Using just key and the precondition that the sequence is already 'sorted':

    import re
    
    s = ['A', 'B' , 'B1', 'B11', 'B2', 'B21', 'B22', 'C', 'C1', 'C11', 'C2']
    
    def subgroup_ordinate(element):
        # Split the sequence element values into groups and ordinal values.
        # use a simple regex and int() in this case
        m = re.search('(B)(.+)', element)  
        if m:
            subgroup = m.group(1)
            ordinate = int(m.group(2))
        else:
            subgroup = element
            ordinate = None
        return (subgroup, ordinate)
    
    print sorted(s, key=subgroup_ordinate)
    
    ['A', 'B', 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'C1', 'C2', 'C11']
    

    The subgroup_ordinate() function does two things: identifies groups to be sorted and also determines the ordinal number within the groups. This example uses regular expression but the function could be arbitrarily complex. For example we can change it to ur'(B|C)(.+)' and sort both B and C sequences .

    ['A', 'B', 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'C1', 'C2', 'C11']
    

    Reading the bounty question carefully I note the requirement 'sorts some values while leaving others "in place"'. Defining the comparison function to return 0 for elements that are not in subgroups would leave these elements where they were in the sequence.

    s2 = ['X', 'B', 'B1', 'B2', 'B11', 'B21', 'A', 'C', 'C1', 'C2', 'C11']
    
    def compare((_a,a),(_b,b)):
        return 0 if a is None or b is None else cmp(a,b)
    
    print sorted(s, compare, subgroup_ordinate)
    
    ['X', 'B', 'B1', 'B2', 'B11', 'B21', 'A', 'C', 'C1', 'C2', 'C11']
    
    0 讨论(0)
提交回复
热议问题