tokenize a string keeping delimiters in Python

Is there any equivalent to str.split in Python that also returns the delimiters?

I need to preserve the whitespace layout for my output after processing some of the tokens.

Example:

>>> s="\tthis is an  example"
>>> print s.split()
['this', 'is', 'an', 'example']

>>> print what_I_want(s)
['\t', 'this', ' ', 'is', ' ', 'an', '  ', 'example']

Thanks!

How about

import re
splitter = re.compile(r'(\s+|\S+)')
splitter.findall(s)

>>> re.compile(r'(\s+)').split("\tthis is an  example")
['', '\t', 'this', ' ', 'is', ' ', 'an', '  ', 'example']

the re module provides this functionality:

>>> import re
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']

(quoted from the Python documentation).

For your example (split on whitespace), use re.split('(\s+)', '\tThis is an example').

The key is to enclose the regex on which to split in capturing parentheses. That way, the delimiters are added to the list of results.

Edit: As pointed out, any preceding/trailing delimiters will of course also be added to the list. To avoid that you can use the .strip() method on your input string first.

Have you looked at pyparsing? Example borrowed from the pyparsing wiki:

>>> from pyparsing import Word, alphas
>>> greet = Word(alphas) + "," + Word(alphas) + "!"
>>> hello1 = 'Hello, World!'
>>> hello2 = 'Greetings, Earthlings!'
>>> for hello in hello1, hello2:
...     print (u'%s \u2192 %r' % (hello, greet.parseString(hello))).encode('utf-8')
... 
Hello, World! → (['Hello', ',', 'World', '!'], {})
Greetings, Earthlings! → (['Greetings', ',', 'Earthlings', '!'], {})

Thanks guys for pointing for the re module, I'm still trying to decide between that and using my own function that returns a sequence...

def split_keep_delimiters(s, delims="\t\n\r "):
    delim_group = s[0] in delims
    start = 0
    for index, char in enumerate(s):
        if delim_group != (char in delims):
            delim_group ^= True
            yield s[start:index]
            start = index
    yield s[start:index+1]

If I had time I'd benchmark them xD

来源：https://stackoverflow.com/questions/1820336/tokenize-a-string-keeping-delimiters-in-python

标签

python

string

split

tokenize