Modify python nltk.word_tokenize to exclude “#” as delimiter

问题

I am using Python's NLTK library to tokenize my sentences.

If my code is

text = "C# billion dollars; we don't own an ounce C++"
print nltk.word_tokenize(text)

I get this as my output

['C', '#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']

The symbols ; , . , # are considered as delimiters. Is there a way to remove # from the set of delimiters like how + isn't a delimiter and thus C++ appears as a single token?

I want my output to be

['C#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']

I want C# to be considered as one token.

回答1:

Another idea: instead of altering how text is tokenized, just loop over the tokens and join every '#' with the preceding one.

txt = "C# billion dollars; we don't own an ounce C++"
tokens = word_tokenize(txt)

i_offset = 0
for i, t in enumerate(tokens):
    i -= i_offset
    if t == '#' and i > 0:
        left = tokens[:i-1]
        joined = [tokens[i - 1] + t]
        right = tokens[i + 1:]
        tokens = left + joined + right
        i_offset += 1

>>> tokens
['C#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']

回答2:

As dealing with multi-word tokenization, another way would be to retokenize the extracted tokens with NLTK Multi-Word Expression tokenizer:

mwtokenizer = nltk.MWETokenizer(separator='')
mwtokenizer.add_mwe(('c', '#'))
mwtokenizer.tokenize(tokens)

回答3:

NLTK uses regular expressions to tokenize text, so you could use its regexp tokenizer to define your own regexp.

I'll create an example for you where text will be split on any space character (tab, new line, ecc) and a couple of other symbols just for instance:

>>> txt = "C# billion dollars; we don't own an ounce C++"
>>> regexp_tokenize(txt, pattern=r"\s|[\.,;']", gaps=True)
['C#', 'billion', 'dollars', 'we', 'don', 't', 'own', 'an', 'ounce', 'C++']

来源：https://stackoverflow.com/questions/35674103/modify-python-nltk-word-tokenize-to-exclude-as-delimiter

标签

python

nltk

tokenize