Efficiently split a string using multiple separators and retaining each separator?

前端 未结 9 1358
野趣味
野趣味 2021-02-02 10:44

I need to split strings of data using each character from string.punctuation and string.whitespace as a separator.

Furthermore, I need for the

相关标签:
9条回答
  • 2021-02-02 11:34

    For any arbitrary collection of separators:

    def separate(myStr, seps):
        answer = []
        temp = []
        for char in myStr:
            if char in seps:
                answer.append(''.join(temp))
                answer.append(char)
                temp = []
            else:
                temp.append(char)
        answer.append(''.join(temp))
        return answer
    
    In [4]: print separate("Now is the winter of our discontent", set(' '))
    ['Now', ' ', 'is', ' ', 'the', ' ', 'winter', ' ', 'of', ' ', 'our', ' ', 'discontent']
    
    In [5]: print separate("Now, really - it is the winter of our discontent", set(' ,-'))
    ['Now', ',', '', ' ', 'really', ' ', '', '-', '', ' ', 'it', ' ', 'is', ' ', 'the', ' ', 'winter', ' ', 'of', ' ', 'our', ' ', 'discontent']
    

    Hope this helps

    0 讨论(0)
  • 2021-02-02 11:36

    Depending on the text you are dealing with, you may be able to simplify your concept of delimiters to "anything other than letters and numbers". If this will work, you can use the following regex solution:

    re.findall(r'[a-zA-Z\d]+|[^a-zA-Z\d]', text)
    

    This assumes that you want to split on each individual delimiter character even if they occur consecutively, so 'foo..bar' would become ['foo', '.', '.', 'bar']. If instead you expect ['foo', '..', 'bar'], use [a-zA-Z\d]+|[^a-zA-Z\d]+ (only difference is adding + at the very end).

    0 讨论(0)
  • 2021-02-02 11:42
    from string import punctuation, whitespace
    
    s = "..test. and stuff"
    
    f = lambda s, c: s + ' ' + c + ' ' if c in punctuation else s + c
    l =  sum([reduce(f, word).split() for word in s.split()], [])
    
    print l
    
    0 讨论(0)
提交回复
热议问题