问题
I'm trying to create a summarizer in Java. I'm using the Stanford Log-linear Part-Of-Speech Tagger to tag the words, and then, for certain tags, I'm scoring the sentence and finally in the summary, I'm printing sentences with a high score value. Here's the code:
MaxentTagger tagger = new MaxentTagger("taggers/bidirectional-distsim-wsj-0-18.tagger");
BufferedReader reader = new BufferedReader( new FileReader ("C:\\Summarizer\\src\\summarizer\\testing\\testingtext.txt"));
String line = null;
int score = 0;
StringBuilder stringBuilder = new StringBuilder();
File tempFile = new File("C:\\Summarizer\\src\\summarizer\\testing\\tempFile.txt");
Writer writerForTempFile = new BufferedWriter(new FileWriter(tempFile));
String ls = System.getProperty("line.separator");
while( ( line = reader.readLine() ) != null )
{
stringBuilder.append( line );
stringBuilder.append( ls );
String tagged = tagger.tagString(line);
Pattern pattern = Pattern.compile("[.?!]"); //Find new line
Matcher matcher = pattern.matcher(tagged);
while(matcher.find())
{
Pattern tagFinder = Pattern.compile("/JJ"); // find adjective tag
Matcher tagMatcher = tagFinder.matcher(matcher.group());
while(tagMatcher.find())
{
score++; // increase score of sentence for every occurence of adjective tag
}
if(score > 1)
writerForTempFile.write(stringBuilder.toString());
score = 0;
stringBuilder.setLength(0);
}
}
reader.close();
writerForTempFile.close();
The above code isn't working. Although, if I cut my work and generate score for every line(not sentence),it works. But summaries aren't generated that way,are they? Here's the code for that: (all the declarations being the same as above)
while( ( line = reader.readLine() ) != null )
{
stringBuilder.append( line );
stringBuilder.append( ls );
String tagged = tagger.tagString(line);
Pattern tagFinder = Pattern.compile("/JJ"); // find adjective tag
Matcher tagMatcher = tagFinder.matcher(tagged);
while(tagMatcher.find())
{
score++; //increase score of line for every occurence of adjective tag
}
if(score > 1)
writerForTempFile.write(stringBuilder.toString());
score = 0;
stringBuilder.setLength(0);
}
EDIT 1:
Information regarding what the MaxentTagger does. A sample code to show it's functioning :
import java.io.IOException;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;
public class TagText {
public static void main(String[] args) throws IOException,
ClassNotFoundException {
// Initialize the tagger
MaxentTagger tagger = new MaxentTagger(
"taggers/bidirectional-distsim-wsj-0-18.tagger");
// The sample string
String sample = "This is a sample text";
// The tagged string
String tagged = tagger.tagString(sample);
// Output the result
System.out.println(tagged);
}
}
Output:
This/DT is/VBZ a/DT sample/NN sentence/NN
EDIT 2:
Modified code using BreakIterator to find sentence breaks. Yet the problem is persisting.
while( ( line = reader.readLine() ) != null )
{
stringBuilder.append( line );
stringBuilder.append( ls );
String tagged = tagger.tagString(line);
BreakIterator bi = BreakIterator.getSentenceInstance();
bi.setText(tagged);
int end, start = bi.first();
while ((end = bi.next()) != BreakIterator.DONE)
{
String sentence = tagged.substring(start, end);
Pattern tagFinder = Pattern.compile("/JJ");
Matcher tagMatcher = tagFinder.matcher(sentence);
while(tagMatcher.find())
{
score++;
}
scoreTracker.add(score);
if(score > 1)
writerForTempFile.write(stringBuilder.toString());
score = 0;
stringBuilder.setLength(0);
start = end;
}
回答1:
Finding sentence breaks can be a bit more involved than just looking for [.?!], consider using BreakIterator.getSentenceInstance()
Its performance is actually quite similar to LingPipe's (more complex) implementation, and better than the one in OpenNLP (from my own testing, at least).
Sample Code
BreakIterator bi = BreakIterator.getSentenceInstance();
bi.setText(text);
int end, start = bi.first();
while ((end = bi.next()) != BreakIterator.DONE) {
String sentence = text.substring(start, end);
start = end;
}
Edit
I think this is what you're looking for:
Pattern tagFinder = Pattern.compile("/JJ");
BufferedReader reader = getMyReader();
String line = null;
while ((line = reader.readLine()) != null) {
BreakIterator bi = BreakIterator.getSentenceInstance();
bi.setText(line);
int end, start = bi.first();
while ((end = bi.next()) != BreakIterator.DONE) {
String sentence = line.substring(start, end);
String tagged = tagger.tagString(sentence);
int score = 0;
Matcher tag = tagFinder.matcher(tagged);
while (tag.find())
score++;
if (score > 1)
writerForTempFile.println(sentence);
start = end;
}
}
回答2:
Without understanding it all, my guess would be that your code should more be like this:
int lastMatch = 0;// Added
Pattern pattern = Pattern.compile("[.?!]"); //Find new line
Matcher matcher = pattern.matcher(tagged);
while(matcher.find())
{
Pattern tagFinder = Pattern.compile("/JJ"); // find adjective tag
// HERE START OF MY CHANGE
String sentence = tagged.substring(lastMatch, matcher.end());
lastMatch = matcher.end();
Matcher tagMatcher = tagFinder.matcher(sentence);
// HERE END OF MY CHANGE
while(tagMatcher.find())
{
score++; // increase score of sentence for every occurence of adjective tag
}
if(score > 1)
writerForTempFile.write(sentence);
score = 0;
}
来源:https://stackoverflow.com/questions/9702739/score-each-sentence-in-a-line-based-upon-a-tag-and-summarize-the-text-java