Split Strings into words with multiple word boundary delimiters

前端 未结 30 2634
既然无缘
既然无缘 2020-11-21 05:09

I think what I want to do is a fairly common task but I\'ve found no reference on the web. I have text with punctuation, and I want a list of the words.

\"H         


        
30条回答
  •  北荒
    北荒 (楼主)
    2020-11-21 06:07

    I like re, but here is my solution without it:

    from itertools import groupby
    sep = ' ,-!?'
    s = "Hey, you - what are you doing here!?"
    print [''.join(g) for k, g in groupby(s, sep.__contains__) if not k]
    

    sep.__contains__ is a method used by 'in' operator. Basically it is the same as

    lambda ch: ch in sep
    

    but is more convenient here.

    groupby gets our string and function. It splits string in groups using that function: whenever a value of function changes - a new group is generated. So, sep.__contains__ is exactly what we need.

    groupby returns a sequence of pairs, where pair[0] is a result of our function and pair[1] is a group. Using 'if not k' we filter out groups with separators (because a result of sep.__contains__ is True on separators). Well, that's all - now we have a sequence of groups where each one is a word (group is actually an iterable so we use join to convert it to string).

    This solution is quite general, because it uses a function to separate string (you can split by any condition you need). Also, it doesn't create intermediate strings/lists (you can remove join and the expression will become lazy, since each group is an iterator)

提交回复
热议问题