tokenize a string keeping delimiters in Python

后端 未结 5 971
无人共我
无人共我 2021-02-05 10:32

Is there any equivalent to str.split in Python that also returns the delimiters?

I need to preserve the whitespace layout for my output after processing som

相关标签:
5条回答
  • 2021-02-05 11:05

    the re module provides this functionality:

    >>> import re
    >>> re.split('(\W+)', 'Words, words, words.')
    ['Words', ', ', 'words', ', ', 'words', '.', '']
    

    (quoted from the Python documentation).

    For your example (split on whitespace), use re.split('(\s+)', '\tThis is an example').

    The key is to enclose the regex on which to split in capturing parentheses. That way, the delimiters are added to the list of results.

    Edit: As pointed out, any preceding/trailing delimiters will of course also be added to the list. To avoid that you can use the .strip() method on your input string first.

    0 讨论(0)
  • 2021-02-05 11:10

    How about

    import re
    splitter = re.compile(r'(\s+|\S+)')
    splitter.findall(s)
    
    0 讨论(0)
  • 2021-02-05 11:16
    >>> re.compile(r'(\s+)').split("\tthis is an  example")
    ['', '\t', 'this', ' ', 'is', ' ', 'an', '  ', 'example']
    
    0 讨论(0)
  • 2021-02-05 11:25

    Thanks guys for pointing for the re module, I'm still trying to decide between that and using my own function that returns a sequence...

    def split_keep_delimiters(s, delims="\t\n\r "):
        delim_group = s[0] in delims
        start = 0
        for index, char in enumerate(s):
            if delim_group != (char in delims):
                delim_group ^= True
                yield s[start:index]
                start = index
        yield s[start:index+1]
    

    If I had time I'd benchmark them xD

    0 讨论(0)
  • 2021-02-05 11:29

    Have you looked at pyparsing? Example borrowed from the pyparsing wiki:

    >>> from pyparsing import Word, alphas
    >>> greet = Word(alphas) + "," + Word(alphas) + "!"
    >>> hello1 = 'Hello, World!'
    >>> hello2 = 'Greetings, Earthlings!'
    >>> for hello in hello1, hello2:
    ...     print (u'%s \u2192 %r' % (hello, greet.parseString(hello))).encode('utf-8')
    ... 
    Hello, World! → (['Hello', ',', 'World', '!'], {})
    Greetings, Earthlings! → (['Greetings', ',', 'Earthlings', '!'], {})
    
    0 讨论(0)
提交回复
热议问题