How do I use NLTK's default tokenizer to get spans instead of strings?

北城以北 提交于 2020-01-02 00:56:09

问题


NLTK's default tokenizer, nltk.word_tokenizer, chains two tokenizers, a sentence tokenizer and then a word tokenizer that operates on sentences. It does a pretty good job out of the box.

>>> nltk.word_tokenize("(Dr. Edwards is my friend.)")
['(', 'Dr.', 'Edwards', 'is', 'my', 'friend', '.', ')']

I'd like to use this same algorithm except to have it return tuples of offsets into the original string instead of string tokens.

By offset I mean 2-ples that can serve as indexes into the original string. For example here I'd have

>>> s = "(Dr. Edwards is my friend.)"
>>> s.token_spans()
[(0,1), (1,4), (5,12), (13,15), (16,18), (19,25), (25,26), (26,27)]

because s[0:1] is "(", s[1:4] is "Dr." and so forth.

Is there a single NLTK call that does this, or do I have to write my own offset arithmetic?


回答1:


At least since NLTK 3.4 TreebankWordTokenizer supports span_tokenize:

>>> from nltk.tokenize import TreebankWordTokenizer as twt
>>> list(twt().span_tokenize('What is the airspeed of an unladen swallow ?'))
[(0, 4),
 (5, 7),
 (8, 11),
 (12, 20),
 (21, 23),
 (24, 26),
 (27, 34),
 (35, 42),
 (43, 44)]



回答2:


Yes, most Tokenizers in nltk have a method called span_tokenize but unfortunately the Tokenizer you are using doesn't.

By default the word_tokenize function uses a TreebankWordTokenizer. The TreebankWordTokenizer implementation has a fairly robust implementation but currently it lacks an implementation for one important method, span_tokenize.

I see no implementation of span_tokenize for a TreebankWordTokenizer so I believe you will need to implement your own. Subclassing TokenizerI can make this process a little less complex.

You might find the span_tokenize method of PunktWordTokenizer useful as a starting point.

I hope this info helps.



来源:https://stackoverflow.com/questions/28678318/how-do-i-use-nltks-default-tokenizer-to-get-spans-instead-of-strings

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!