RegEx Tokenizer: split text into words, digits, punctuation, and spacing (do not delete anything)

天大地大妈咪最大 提交于 2019-12-08 17:39:45

问题


I almost found the answer to this question in this thread (samplebias's answer); however I need to split a phrase into words, digits, punctuation marks, and spaces/tabs. I also need this to preserve the order in which each of these things occurs (which the code in that thread already does).

So, what I've found is something like this:

    from nltk.tokenize import *
    txt = "Today it's   07.May 2011. Or 2.999."
    regexp_tokenize(txt, pattern=r'\w+([.,]\w+)*|\S+')
    ['Today', 'it', "'s", '07.May', '2011', '.', 'Or', '2.999', '.']

But this is the kind of list I need to yield:

    ['Today', ' ', 'it', "'s", ' ', '\t', '07.May', ' ', '2011', '.', ' ', 'Or', ' ', '2.999', '.']

Regex has always been one of my weakpoints so after a couple hours of research I'm still stumped. Thank you!!


回答1:


I think that something like this should work for you. There is probably more in that regex than there needs to be, but your requirements are somewhat vague and don't exactly match up with the expected output you provided.

>>> txt = "Today it's \t07.May 2011. Or 2.999."
>>> p = re.compile(r"\d+|[-'a-z]+|[ ]+|\s+|[.,]+|\S+", re.I)
>>> slice_starts = [m.start() for m in p.finditer(txt)] + [None]
>>> [txt[s:e] for s, e in zip(slice_starts, slice_starts[1:])]
['Today', ' ', "it's", ' ', '\t', '07', '.', 'May', ' ', '2011', '.', ' ', 'Or', ' ', '2', '.', '999', '.']



回答2:


In the regex \w+([.,]\w+)*|\S+, \w+([.,]\w+)* captures words and \S+ captures other non-whitespace.

In order to capture spaces and tabs as well, try this: \w+([.,]\w+)*|\S+|[ \t].




回答3:


Not fully compliant with the expected output you provided, some more details in the question would help, but anyway:

>>> txt = "Today it's   07.May 2011. Or 2.999."
>>> regexp_tokenize(txt, pattern=r"\w+([.',]\w+)*|[ \t]+")
['Today', ' ', "it's", ' \t', '07.May', ' ', '2011', ' ', 'Or', ' ', '2.999']


来源:https://stackoverflow.com/questions/6987356/regex-tokenizer-split-text-into-words-digits-punctuation-and-spacing-do-not

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!