How can I split a text into sentences using the Stanford parser?

后端 未结 12 1883
终归单人心
终归单人心 2020-11-27 14:52

How can I split a text or paragraph into sentences using Stanford parser?

Is there any method that can extract sentences, such as getSentencesFromString()

相关标签:
12条回答
  • 2020-11-27 15:23
    public class k {
    
    public static void main(String a[]){
    
        String str = "This program splits a string based on space";
        String[] words = str.split(" ");
        for(String s:words){
            System.out.println(s);
        }
        str = "This     program  splits a string based on space";
        words = str.split("\\s+");
    }
    }
    
    0 讨论(0)
  • 2020-11-27 15:27

    You can check the DocumentPreprocessor class. Below is a short snippet. I think there may be other ways to do what you want.

    String paragraph = "My 1st sentence. “Does it work for questions?” My third sentence.";
    Reader reader = new StringReader(paragraph);
    DocumentPreprocessor dp = new DocumentPreprocessor(reader);
    List<String> sentenceList = new ArrayList<String>();
    
    for (List<HasWord> sentence : dp) {
       // SentenceUtils not Sentence
       String sentenceString = SentenceUtils.listToString(sentence);
       sentenceList.add(sentenceString);
    }
    
    for (String sentence : sentenceList) {
       System.out.println(sentence);
    }
    
    0 讨论(0)
  • 2020-11-27 15:29

    You can use the document preprocessor. It's really easy. Just feed it a filename.

        for (List<HasWord> sentence : new DocumentPreprocessor(pathto/filename.txt)) {
             //sentence is a list of words in a sentence
        }
    
    0 讨论(0)
  • 2020-11-27 15:30

    There are a couple issues with the accepted answer. First, the tokenizer transforms some characters, such as the character “ into the two characters ``. Second, joining the tokenized text back together with whitespace does not return the same result as before. Therefore, the example text from the accepted answer transforms the input text in non-trivial ways.

    However, the CoreLabel class that the tokenizer uses keeps track of the source characters they are mapped to, so it is trivial to rebuild the proper string, if you have the original.

    Approach 1 below shows the accepted answers approach, Approach 2 shows my approach, which overcomes these issues.

    String paragraph = "My 1st sentence. “Does it work for questions?” My third sentence.";
    
    List<String> sentenceList;
    
    /* ** APPROACH 1 (BAD!) ** */
    Reader reader = new StringReader(paragraph);
    DocumentPreprocessor dp = new DocumentPreprocessor(reader);
    sentenceList = new ArrayList<String>();
    for (List<HasWord> sentence : dp) {
        sentenceList.add(Sentence.listToString(sentence));
    }
    System.out.println(StringUtils.join(sentenceList, " _ "));
    
    /* ** APPROACH 2 ** */
    //// Tokenize
    List<CoreLabel> tokens = new ArrayList<CoreLabel>();
    PTBTokenizer<CoreLabel> tokenizer = new PTBTokenizer<CoreLabel>(new StringReader(paragraph), new CoreLabelTokenFactory(), "");
    while (tokenizer.hasNext()) {
        tokens.add(tokenizer.next());
    }
    //// Split sentences from tokens
    List<List<CoreLabel>> sentences = new WordToSentenceProcessor<CoreLabel>().process(tokens);
    //// Join back together
    int end;
    int start = 0;
    sentenceList = new ArrayList<String>();
    for (List<CoreLabel> sentence: sentences) {
        end = sentence.get(sentence.size()-1).endPosition();
        sentenceList.add(paragraph.substring(start, end).trim());
        start = end;
    }
    System.out.println(StringUtils.join(sentenceList, " _ "));
    

    This outputs:

    My 1st sentence . _ `` Does it work for questions ? '' _ My third sentence .
    My 1st sentence. _ “Does it work for questions?” _ My third sentence.
    
    0 讨论(0)
  • 2020-11-27 15:34

    With the Simple API provided by Stanford CoreNLP version 3.6.0 or 3.7.0.

    Here's an example with 3.6.0. It works exactly the same with 3.7.0.

    Java Code Snippet

    import java.util.List;
    
    import edu.stanford.nlp.simple.Document;
    import edu.stanford.nlp.simple.Sentence;
    public class TestSplitSentences {
        public static void main(String[] args) {
            Document doc = new Document("The text paragraph. Another sentence. Yet another sentence.");
            List<Sentence> sentences = doc.sentences();
            sentences.stream().forEach(System.out::println);
        }
    }
    

    Yields:

    The text paragraph.

    Another sentence.

    Yet another sentence.

    pom.xml

    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <modelVersion>4.0.0</modelVersion>
    
        <groupId>stanfordcorenlp</groupId>
        <artifactId>stanfordcorenlp</artifactId>
        <version>1.0-SNAPSHOT</version>
    
        <properties>
            <maven.compiler.source>1.8</maven.compiler.source>
            <maven.compiler.target>1.8</maven.compiler.target>
        </properties>
    
        <dependencies>
            <!-- https://mvnrepository.com/artifact/edu.stanford.nlp/stanford-corenlp -->
            <dependency>
                <groupId>edu.stanford.nlp</groupId>
                <artifactId>stanford-corenlp</artifactId>
                <version>3.6.0</version>
            </dependency>
            <!-- https://mvnrepository.com/artifact/com.google.protobuf/protobuf-java -->
            <dependency>
                <groupId>com.google.protobuf</groupId>
                <artifactId>protobuf-java</artifactId>
                <version>2.6.1</version>
            </dependency>
        </dependencies>
    </project>
    
    0 讨论(0)
  • 2020-11-27 15:36

    I know there is already an accepted answer...but typically you'd just grab the SentenceAnnotations from an annotated doc.

    // creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution 
    Properties props = new Properties();
    props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    
    // read some text in the text variable
    String text = ... // Add your text here!
    
    // create an empty Annotation just with the given text
    Annotation document = new Annotation(text);
    
    // run all Annotators on this text
    pipeline.annotate(document);
    
    // these are all the sentences in this document
    // a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
    List<CoreMap> sentences = document.get(SentencesAnnotation.class);
    
    for(CoreMap sentence: sentences) {
      // traversing the words in the current sentence
      // a CoreLabel is a CoreMap with additional token-specific methods
      for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
        // this is the text of the token
        String word = token.get(TextAnnotation.class);
        // this is the POS tag of the token
        String pos = token.get(PartOfSpeechAnnotation.class);
        // this is the NER label of the token
        String ne = token.get(NamedEntityTagAnnotation.class);       
      }
    
    }
    

    Source - http://nlp.stanford.edu/software/corenlp.shtml (half way down)

    And if you're only looking for sentences, you can drop the later steps like "parse" and "dcoref" from the pipeline initialization, it'll save you some load and processing time. Rock and roll. ~K

    0 讨论(0)
提交回复
热议问题