Python regex: splitting on pattern match that is an empty string

前端 未结 2 567
情书的邮戳
情书的邮戳 2020-11-28 14:34

With the re module, it seems that I am unable to split on pattern matches that are empty strings:

>>> re.split(r\'(?

        
相关标签:
2条回答
  • 2020-11-28 15:05
    import regex
    x="bazbarbarfoobar"
    print regex.split(r"(?<!baz)(?=bar)",x,flags=regex.VERSION1)
    

    You can use regex module here for this.

    or

    (.+?(?<!foo))(?=bar|$)|(.+?foo)$
    

    Use re.findall .

    See demo

    0 讨论(0)
  • 2020-11-28 15:19

    It is unfortunate that the split requires a non-zero-width match, but it hasn't been to fixed yet, since quite a lot incorrect code depends on the current behaviour by using for example [something]*as the regex. Use of such patterns will now generate a FutureWarning and those that never can split anything, throw a ValueError from Python 3.5 onwards:

    >>> re.split(r'(?<!foo)(?=bar)', 'foobarbarbazbar')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python3.6/re.py", line 212, in split
        return _compile(pattern, flags).split(string, maxsplit)
    ValueError: split() requires a non-empty pattern match.
    

    The idea is that after a certain period of warnings, the behaviour can be changed so that your regular expression would work again.


    If you can't use the regex module, you can write your own split function using re.finditer():

    def megasplit(pattern, string):
        splits = list((m.start(), m.end()) for m in re.finditer(pattern, string))
        starts = [0] + [i[1] for i in splits]
        ends = [i[0] for i in splits] + [len(string)]
        return [string[start:end] for start, end in zip(starts, ends)]
    
    print(megasplit(r'(?<!foo)(?=bar)', 'foobarbarbazbar'))
    print(megasplit(r'o', 'foobarbarbazbar'))
    

    If you are sure that the matches are zero-width only, you can use the starts of the splits for easier code:

    import re
    
    def zerowidthsplit(pattern, string):
        splits = list(m.start() for m in re.finditer(pattern, string))
        starts = [0] + splits
        ends = splits + [ len(string) ]
        return [string[start:end] for start, end in zip(starts, ends)]
    
    print(zerowidthsplit(r'(?<!foo)(?=bar)', 'foobarbarbazbar'))
    
    0 讨论(0)
提交回复
热议问题