NLTK Sentence Tokenizer, custom sentence starters

拈花ヽ惹草 提交于 2021-02-08 05:29:23

问题


I'm trying to split a text into sentences with the PunktSentenceTokenizer from nltk. The text contains listings starting with bullet points, but they are not recognized as new sentences. I tried to add some parameters but that didn't work. Is there another way?

Here is some example code:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters

params = PunktParameters()
params.sent_starters = set(['•'])
tokenizer = PunktSentenceTokenizer(params)

tokenizer.tokenize('• I am a sentence • I am another sentence')
['• I am a sentence • I am another sentence']

回答1:


You can subclass PunktLanguageVars and adapt the sent_end_chars attribute to fit your needs like so:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktLanguageVars

class BulletPointLangVars(PunktLanguageVars):
    sent_end_chars = ('.', '?', '!', '•')

tokenizer = PunktSentenceTokenizer(lang_vars = BulletPointLangVars())
tokenizer.tokenize(u"• I am a sentence • I am another sentence")

This will result in the following output:

['•', 'I am a sentence •', 'I am another sentence']

However, this makes • a sentence end marker, while in your case it is more of a sentence start marker. Thus this example text:

I introduce a list of sentences.

  • I am sentence one
  • I am sentence two

And I am one, too!

Would, depending on the details of your text, result in something like the following:

>>> tokenizer.tokenize("""
Look at these sentences:

• I am sentence one
• I am sentence two

But I am one, too!
""")

['\nLook at these sentences:\n\n•', 'I am sentence one\n•', 'I am sentence two\n\nBut I am one, too!\n']

One reason why PunktSentenceTokenizer is used for sentence tokenization instead of simply employing something like a multi-delimiter split function is, because it is able to learn how to distinguish between punctuation used for sentences and punctuation used for other purposes, as in "Mr.", for example.

There should, however, be no such complications for •, so you I would advise you to write a simple parser to preprocess the bullet point formatting instead of abusing PunktSentenceTokenizer for something it is not really designed for. How this might be achieved in detail is dependent on how exactly this kind of markup is used in the text.



来源:https://stackoverflow.com/questions/29746635/nltk-sentence-tokenizer-custom-sentence-starters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!