NLTK Sentence Tokenizer, custom sentence starters

问题

I'm trying to split a text into sentences with the PunktSentenceTokenizer from nltk. The text contains listings starting with bullet points, but they are not recognized as new sentences. I tried to add some parameters but that didn't work. Is there another way?

Here is some example code:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters

params = PunktParameters()
params.sent_starters = set(['•'])
tokenizer = PunktSentenceTokenizer(params)

tokenizer.tokenize('• I am a sentence • I am another sentence')
['• I am a sentence • I am another sentence']

回答1:

You can subclass PunktLanguageVars and adapt the sent_end_chars attribute to fit your needs like so:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktLanguageVars

class BulletPointLangVars(PunktLanguageVars):
    sent_end_chars = ('.', '?', '!', '•')

tokenizer = PunktSentenceTokenizer(lang_vars = BulletPointLangVars())
tokenizer.tokenize(u"• I am a sentence • I am another sentence")

This will result in the following output:

['•', 'I am a sentence •', 'I am another sentence']

However, this makes • a sentence end marker, while in your case it is more of a sentence start marker. Thus this example text:

I introduce a list of sentences.

I am sentence one

I am sentence two

And I am one, too!

Would, depending on the details of your text, result in something like the following:

>>> tokenizer.tokenize("""
Look at these sentences:

• I am sentence one
• I am sentence two

But I am one, too!
""")

['\nLook at these sentences:\n\n•', 'I am sentence one\n•', 'I am sentence two\n\nBut I am one, too!\n']

One reason why PunktSentenceTokenizer is used for sentence tokenization instead of simply employing something like a multi-delimiter split function is, because it is able to learn how to distinguish between punctuation used for sentences and punctuation used for other purposes, as in "Mr.", for example.

There should, however, be no such complications for •, so you I would advise you to write a simple parser to preprocess the bullet point formatting instead of abusing PunktSentenceTokenizer for something it is not really designed for. How this might be achieved in detail is dependent on how exactly this kind of markup is used in the text.

来源：https://stackoverflow.com/questions/29746635/nltk-sentence-tokenizer-custom-sentence-starters

标签

python

python-3.x

nltk

tokenize