问题
I'm trying to split a text into sentences with the PunktSentenceTokenizer
from nltk. The text contains listings starting with bullet points, but they are not recognized as new sentences. I tried to add some parameters but that didn't work. Is there another way?
Here is some example code:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
params = PunktParameters()
params.sent_starters = set(['•'])
tokenizer = PunktSentenceTokenizer(params)
tokenizer.tokenize('• I am a sentence • I am another sentence')
['• I am a sentence • I am another sentence']
回答1:
You can subclass PunktLanguageVars
and adapt the sent_end_chars
attribute to fit your needs like so:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktLanguageVars
class BulletPointLangVars(PunktLanguageVars):
sent_end_chars = ('.', '?', '!', '•')
tokenizer = PunktSentenceTokenizer(lang_vars = BulletPointLangVars())
tokenizer.tokenize(u"• I am a sentence • I am another sentence")
This will result in the following output:
['•', 'I am a sentence •', 'I am another sentence']
However, this makes • a sentence end marker, while in your case it is more of a sentence start marker. Thus this example text:
I introduce a list of sentences.
- I am sentence one
- I am sentence two
And I am one, too!
Would, depending on the details of your text, result in something like the following:
>>> tokenizer.tokenize("""
Look at these sentences:
• I am sentence one
• I am sentence two
But I am one, too!
""")
['\nLook at these sentences:\n\n•', 'I am sentence one\n•', 'I am sentence two\n\nBut I am one, too!\n']
One reason why PunktSentenceTokenizer
is used for sentence tokenization instead of simply employing something like a multi-delimiter split function is, because it is able to learn how to distinguish between punctuation used for sentences and punctuation used for other purposes, as in "Mr.", for example.
There should, however, be no such complications for •, so you I would advise you to write a simple parser to preprocess the bullet point formatting instead of abusing PunktSentenceTokenizer
for something it is not really designed for.
How this might be achieved in detail is dependent on how exactly this kind of markup is used in the text.
来源:https://stackoverflow.com/questions/29746635/nltk-sentence-tokenizer-custom-sentence-starters