Modify python nltk.word_tokenize to exclude “#” as delimiter

血红的双手。 提交于 2019-12-23 05:00:16

问题


I am using Python's NLTK library to tokenize my sentences.

If my code is

text = "C# billion dollars; we don't own an ounce C++"
print nltk.word_tokenize(text)

I get this as my output

['C', '#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']

The symbols ; , . , # are considered as delimiters. Is there a way to remove # from the set of delimiters like how + isn't a delimiter and thus C++ appears as a single token?

I want my output to be

['C#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']

I want C# to be considered as one token.


回答1:


Another idea: instead of altering how text is tokenized, just loop over the tokens and join every '#' with the preceding one.

txt = "C# billion dollars; we don't own an ounce C++"
tokens = word_tokenize(txt)

i_offset = 0
for i, t in enumerate(tokens):
    i -= i_offset
    if t == '#' and i > 0:
        left = tokens[:i-1]
        joined = [tokens[i - 1] + t]
        right = tokens[i + 1:]
        tokens = left + joined + right
        i_offset += 1

>>> tokens
['C#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']



回答2:


As dealing with multi-word tokenization, another way would be to retokenize the extracted tokens with NLTK Multi-Word Expression tokenizer:

mwtokenizer = nltk.MWETokenizer(separator='')
mwtokenizer.add_mwe(('c', '#'))
mwtokenizer.tokenize(tokens)



回答3:


NLTK uses regular expressions to tokenize text, so you could use its regexp tokenizer to define your own regexp.

I'll create an example for you where text will be split on any space character (tab, new line, ecc) and a couple of other symbols just for instance:

>>> txt = "C# billion dollars; we don't own an ounce C++"
>>> regexp_tokenize(txt, pattern=r"\s|[\.,;']", gaps=True)
['C#', 'billion', 'dollars', 'we', 'don', 't', 'own', 'an', 'ounce', 'C++']


来源:https://stackoverflow.com/questions/35674103/modify-python-nltk-word-tokenize-to-exclude-as-delimiter

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!