Are there any guarantees about the splitting order of str.split()?

后端 未结 2 861
逝去的感伤
逝去的感伤 2021-01-06 23:13

According to the Python 2.7 docs, using str.split() with maxsplit specified will split a string up to maxsplit times.

However

相关标签:
2条回答
  • 2021-01-07 00:05

    If you're looking for guarantees that splitting with the maxsplit argument splits from left-to-right, you only need to look at the builtin python test suite.

    Here's an excerpt:

        self.checkequal(['a', 'b', 'c', 'd'], 'a|b|c|d', 'split', '|')
        self.checkequal(['a|b|c|d'], 'a|b|c|d', 'split', '|', 0)
        self.checkequal(['a', 'b|c|d'], 'a|b|c|d', 'split', '|', 1)
        self.checkequal(['a', 'b', 'c|d'], 'a|b|c|d', 'split', '|', 2)
        self.checkequal(['a', 'b', 'c', 'd'], 'a|b|c|d', 'split', '|', 3)
        self.checkequal(['a', 'b', 'c', 'd'], 'a|b|c|d', 'split', '|', 4)
        self.checkequal(['a', 'b', 'c', 'd'], 'a|b|c|d', 'split', '|',
                        sys.maxsize-2)
        self.checkequal(['a|b|c|d'], 'a|b|c|d', 'split', '|', 0)
        self.checkequal(['a', '', 'b||c||d'], 'a||b||c||d', 'split', '|', 2)
        self.checkequal(['abcd'], 'abcd', 'split', '|')
        self.checkequal([''], '', 'split', '|')
        self.checkequal(['endcase ', ''], 'endcase |', 'split', '|')
        self.checkequal(['', ' startcase'], '| startcase', 'split', '|')
        self.checkequal(['', 'bothcase', ''], '|bothcase|', 'split', '|')
        self.checkequal(['a', '', 'b\x00c\x00d'], 'a\x00\x00b\x00c\x00d', 'split', '\x00', 2)
    

    From the tests, it is clear that any implementation that did something different would fail these tests.

    0 讨论(0)
  • 2021-01-07 00:16

    CPython is considered to be the reference implementation of Python. According to CPython source code str.split is guaranteed to split in left-to-right order. You can look up how str.split is implemented, here is a link http://svn.python.org/view/python/tags/r271/Objects/stringlib/split.h?view=markup

    For example, in stringlib_split_char (as well as in stringlib_split_whitespace, which are both used in stringlib_split (str.split)) one can clearly see that the string is processed from left to right (i and j are used to index the string, they both start with zero and are being incremented, maxsplit does not affect how indexes are treated, maxsplit only provides early exit from the loop):

    Py_LOCAL_INLINE(PyObject *)
    stringlib_split_char(PyObject* str_obj,
                         const STRINGLIB_CHAR* str, Py_ssize_t str_len,
                         const STRINGLIB_CHAR ch,
                         Py_ssize_t maxcount)
    {
        // ... some code omitted
    
        i = j = 0;
        while ((j < str_len) && (maxcount-- > 0)) {
            for(; j < str_len; j++) {
                /* I found that using memchr makes no difference */
                if (str[j] == ch) {
                    SPLIT_ADD(str, i, j);
                    i = j = j + 1;
                    break;
                }
            }
        }
        // ... some code omitted
    

    And in stringlib_rsplit_char (used in str.rsplit) both i and j indexes start at the end of string and being decremented:

    i = j = str_len - 1;
    while ((i >= 0) && (maxcount-- > 0)) {
        for(; i >= 0; i--) {
            if (str[i] == ch) {
                SPLIT_ADD(str, i + 1, j + 1);
                j = i = i - 1;
                break;
            }
        }
    
    0 讨论(0)
提交回复
热议问题