Efficiently split a string using multiple separators and retaining each separator?

前端 未结 9 1357
野趣味
野趣味 2021-02-02 10:44

I need to split strings of data using each character from string.punctuation and string.whitespace as a separator.

Furthermore, I need for the

相关标签:
9条回答
  • 2021-02-02 11:27

    Try this:

    import re
    re.split('(['+re.escape(string.punctuation + string.whitespace)+']+)',"Now is the winter of our discontent")
    

    Explanation from the Python documentation:

    If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

    0 讨论(0)
  • 2021-02-02 11:28
    from itertools import chain, cycle, izip
    
    s = "Now is the winter of our discontent"
    words = s.split()
    
    wordsWithWhitespace = list( chain.from_iterable( izip( words, cycle([" "]) ) ) )
    # result : ['Now', ' ', 'is', ' ', 'the', ' ', 'winter', ' ', 'of', ' ', 'our', ' ', 'discontent', ' ']
    
    0 讨论(0)
  • 2021-02-02 11:29
    import re
    import string
    
    p = re.compile("[^{0}]+|[{0}]+".format(re.escape(
        string.punctuation + string.whitespace)))
    
    print p.findall("Now is the winter of our discontent")
    

    I'm no big fan of using regexps for all problems, but I don't think you have much choice in this if you want it fast and short.

    I'll explain the regexp since you're not familiar with it:

    • [...] means any of the characters inside the square brackets
    • [^...] means any of the characters not inside the square brackets
    • + behind means one or more of the previous thing
    • x|y means to match either x or y

    So the regexp matches 1 or more characters where either all must be punctuation and whitespace, or none must be. The findall method finds all non-overlapping matches of the pattern.

    0 讨论(0)
  • 2021-02-02 11:30

    A different non-regex approach from the others:

    >>> import string
    >>> from itertools import groupby
    >>> 
    >>> special = set(string.punctuation + string.whitespace)
    >>> s = "One two  three    tab\ttabandspace\t end"
    >>> 
    >>> split_combined = [''.join(g) for k, g in groupby(s, lambda c: c in special)]
    >>> split_combined
    ['One', ' ', 'two', '  ', 'three', '    ', 'tab', '\t', 'tabandspace', '\t ', 'end']
    >>> split_separated = [''.join(g) for k, g in groupby(s, lambda c: c if c in special else False)]
    >>> split_separated
    ['One', ' ', 'two', '  ', 'three', '    ', 'tab', '\t', 'tabandspace', '\t', ' ', 'end']
    

    Could use dict.fromkeys and .get instead of the lambda, I guess.

    [edit]

    Some explanation:

    groupby accepts two arguments, an iterable and an (optional) keyfunction. It loops through the iterable and groups them with the value of the keyfunction:

    >>> groupby("sentence", lambda c: c in 'nt')
    <itertools.groupby object at 0x9805af4>
    >>> [(k, list(g)) for k,g in groupby("sentence", lambda c: c in 'nt')]
    [(False, ['s', 'e']), (True, ['n', 't']), (False, ['e']), (True, ['n']), (False, ['c', 'e'])]
    

    where terms with contiguous values of the keyfunction are grouped together. (This is a common source of bugs, actually -- people forget that they have to sort by the keyfunc first if they want to group terms which might not be sequential.)

    As @JonClements guessed, what I had in mind was

    >>> special = dict.fromkeys(string.punctuation + string.whitespace, True)
    >>> s = "One two  three    tab\ttabandspace\t end"
    >>> [''.join(g) for k,g in groupby(s, special.get)]
    ['One', ' ', 'two', '  ', 'three', '    ', 'tab', '\t', 'tabandspace', '\t ', 'end']
    

    for the case where we were combining the separators. .get returns None if the value isn't in the dict.

    0 讨论(0)
  • 2021-02-02 11:31

    Solution in linear (O(n)) time:

    Let's say you have a string:

    original = "a, b...c    d"
    

    First convert all separators to space:

    splitters = string.punctuation + string.whitespace
    trans = string.maketrans(splitters, ' ' * len(splitters))
    s = original.translate(trans)
    

    Now s == 'a b c d'. Now you can use itertools.groupby to alternate between spaces and non-spaces:

    result = []
    position = 0
    for _, letters in itertools.groupby(s, lambda c: c == ' '):
        letter_count = len(list(letters))
        result.append(original[position:position + letter_count])
        position += letter_count
    

    Now result == ['a', ', ', 'b', '...', 'c', ' ', 'd'], which is what you need.

    0 讨论(0)
  • 2021-02-02 11:33

    My take:

    from string import whitespace, punctuation
    import re
    
    pattern = re.escape(whitespace + punctuation)
    print re.split('([' + pattern + '])', 'now is the winter of')
    
    0 讨论(0)
提交回复
热议问题