How can I split a text into sentences using the Stanford parser?

后端 未结 12 1884
终归单人心
终归单人心 2020-11-27 14:52

How can I split a text or paragraph into sentences using Stanford parser?

Is there any method that can extract sentences, such as getSentencesFromString()

相关标签:
12条回答
  • 2020-11-27 15:39

    Using the .net C# package: This will split sentences, get the parentheses correct and preserve original spaces and punctuation:

    public class NlpDemo
    {
        public static readonly TokenizerFactory TokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(),
                    "normalizeParentheses=false,normalizeOtherBrackets=false,invertible=true");
    
        public void ParseFile(string fileName)
        {
            using (var stream = File.OpenRead(fileName))
            {
                SplitSentences(stream);
            }
        }
    
        public void SplitSentences(Stream stream)
        {            
            var preProcessor = new DocumentPreprocessor(new UTF8Reader(new InputStreamWrapper(stream)));
            preProcessor.setTokenizerFactory(TokenizerFactory);
    
            foreach (java.util.List sentence in preProcessor)
            {
                ProcessSentence(sentence);
            }            
        }
    
        // print the sentence with original spaces and punctuation.
        public void ProcessSentence(java.util.List sentence)
        {
            System.Console.WriteLine(edu.stanford.nlp.util.StringUtils.joinWithOriginalWhiteSpace(sentence));
        }
    }
    

    Input: - This sentence's characters possess a certain charm, one often found in punctuation and prose. This is a second sentence? It is indeed.

    Output: 3 sentences ('?' is considered an end-of-sentence delimiter)

    Note: for a sentence like "Mrs. Havisham's class was impeccable (as far as one could see!) in all aspects." The tokenizer will correctly discern that the period at the end of Mrs. is not an EOS, however it will incorrectly mark the ! within the parentheses as an EOS and split "in all aspects." as a second sentence.

    0 讨论(0)
  • 2020-11-27 15:39

    A variation in the @Kevin answer which will solve the question is as follows:

    for(CoreMap sentence: sentences) {
          String sentenceText = sentence.get(TextAnnotation.class)
    }
    

    which gets you the sentence information without bothering with the other annotators.

    0 讨论(0)
  • 2020-11-27 15:39

    Add Path for input and output file in below code:-

    import java.util.*;
    import edu.stanford.nlp.pipeline.*;
    import java.io.BufferedReader;
    import java.io.BufferedWriter;
    import java.io.FileReader;
    import java.io.FileWriter;
    import java.io.IOException;
    import java.io.PrintWriter;
    public class NLPExample
    {
        public static void main(String[] args) throws IOException 
        {
            PrintWriter out;
            out = new PrintWriter("C:\\Users\\ACER\\Downloads\\stanford-corenlp-full-     
            2018-02-27\\output.txt");
            Properties props=new Properties();
            props.setProperty("annotators","tokenize, ssplit, pos,lemma");
            StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
            Annotation annotation;  
            String readString = null;
            PrintWriter pw = null;
            BufferedReader br = null;
            br = new BufferedReader (new 
            FileReader("C:\\Users\\ACER\\Downloads\\stanford- 
            corenlp-full-2018-02-27\\input.txt" )  ) ;
            pw = new PrintWriter ( new BufferedWriter ( new FileWriter ( 
            "C:\\Users\\ACER\\Downloads\\stanford-corenlp-full-2018-02-   
            27\\output.txt",false 
            ))) ;      
            String x = null;
            while  (( readString = br.readLine ())  != null)
            {
                pw.println ( readString ) ; String 
                xx=readString;x=xx;//System.out.println("OKKKKK"); 
                annotation = new Annotation(x);
                pipeline.annotate(annotation);    //System.out.println("LamoohAKA");
                pipeline.prettyPrint(annotation, out);
            }
            br.close (  ) ;
            pw.close (  ) ;
            System.out.println("Done...");
        }    
    }
    
    0 讨论(0)
  • 2020-11-27 15:41

    You can pretty easy use Stanford tagger for this.

    String text = new String("Your text....");  //Your own text.
    List<List<HasWord>> tokenizedSentences = MaxentTagger.tokenizeText(new StringReader(text));
    
    for(List<CoreLabel> act : tokenizedSentences)       //Travel trough sentences
    {
        System.out.println(edu.stanford.nlp.ling.Sentence.listToString(act)); //This is your sentence
    }
    
    0 讨论(0)
  • 2020-11-27 15:44

    Another element, not addressed except in a few downvoted answers, is how to set the sentence delimiters? The most common way, the default, is to depend up the common punctuation marks which state the end of a sentence. There are other document formats that one might face from drawing upon gathered corpora, one of which being each line is it's own sentence.

    To set your delimiters for the DocumentPreprocessor as in the accepted answers, you would use setSentenceDelimiter(String). To use the pipeline approach suggested as in the answer by @Kevin, one would work with the ssplit properties. For example, to use the end of line scheme proposed in the previous paragraph, one would set the property ssplit.eolonly to true

    0 讨论(0)
  • 2020-11-27 15:48

    use regular expression for split text into sentences, in use Regex but in java i dont know.

    code

    string[] sentences = Regex.Split(text, @"(?<=['""a-za-z][\)][\.\!\?])\s+(?=[A-Z])");

    90% works

    0 讨论(0)
提交回复
热议问题