In Python, how do I split a string and keep the separators?

前端 未结 13 1055
[愿得一人]
[愿得一人] 2020-11-22 03:26

Here\'s the simplest way to explain this. Here\'s what I\'m using:

re.split(\'\\W\', \'foo/bar spam\\neggs\')
-> [\'foo\', \'bar\', \'spam\', \'eggs\']


        
相关标签:
13条回答
  • 2020-11-22 03:48

    You can also split a string with an array of strings instead of a regular expression, like this:

    def tokenizeString(aString, separators):
        #separators is an array of strings that are being used to split the string.
        #sort separators in order of descending length
        separators.sort(key=len)
        listToReturn = []
        i = 0
        while i < len(aString):
            theSeparator = ""
            for current in separators:
                if current == aString[i:i+len(current)]:
                    theSeparator = current
            if theSeparator != "":
                listToReturn += [theSeparator]
                i = i + len(theSeparator)
            else:
                if listToReturn == []:
                    listToReturn = [""]
                if(listToReturn[-1] in separators):
                    listToReturn += [""]
                listToReturn[-1] += aString[i]
                i += 1
        return listToReturn
        
    
    print(tokenizeString(aString = "\"\"\"hi\"\"\" hello + world += (1*2+3/5) '''hi'''", separators = ["'''", '+=', '+', "/", "*", "\\'", '\\"', "-=", "-", " ", '"""', "(", ")"]))
    
    0 讨论(0)
  • 2020-11-22 03:54

    another example, split on non alpha-numeric and keep the separators

    import re
    a = "foo,bar@candy*ice%cream"
    re.split('([^a-zA-Z0-9])',a)
    

    output:

    ['foo', ',', 'bar', '@', 'candy', '*', 'ice', '%', 'cream']
    

    explanation

    re.split('([^a-zA-Z0-9])',a)
    
    () <- keep the separators
    [] <- match everything in between
    ^a-zA-Z0-9 <-except alphabets, upper/lower and numbers.
    
    0 讨论(0)
  • 2020-11-22 03:55

    One Lazy and Simple Solution

    Assume your regex pattern is split_pattern = r'(!|\?)'

    First, you add some same character as the new separator, like '[cut]'

    new_string = re.sub(split_pattern, '\\1[cut]', your_string)

    Then you split the new separator, new_string.split('[cut]')

    0 讨论(0)
  • 2020-11-22 03:57

    If one wants to split string while keeping separators by regex without capturing group:

    def finditer_with_separators(regex, s):
        matches = []
        prev_end = 0
        for match in regex.finditer(s):
            match_start = match.start()
            if (prev_end != 0 or match_start > 0) and match_start != prev_end:
                matches.append(s[prev_end:match.start()])
            matches.append(match.group())
            prev_end = match.end()
        if prev_end < len(s):
            matches.append(s[prev_end:])
        return matches
    
    regex = re.compile(r"[\(\)]")
    matches = finditer_with_separators(regex, s)
    

    If one assumes that regex is wrapped up into capturing group:

    def split_with_separators(regex, s):
        matches = list(filter(None, regex.split(s)))
        return matches
    
    regex = re.compile(r"([\(\)])")
    matches = split_with_separators(regex, s)
    

    Both ways also will remove empty groups which are useless and annoying in most of the cases.

    0 讨论(0)
  • 2020-11-22 04:00
    1. replace all seperator: (\W) with seperator + new_seperator: (\W;)

    2. split by the new_seperator: (;)

    def split_and_keep(seperator, s):
      return re.split(';', re.sub(seperator, lambda match: match.group() + ';', s))
    
    print('\W', 'foo/bar spam\neggs')
    
    0 讨论(0)
  • 2020-11-22 04:02
    >>> re.split('(\W)', 'foo/bar spam\neggs')
    ['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']
    
    0 讨论(0)
提交回复
热议问题