how to identify a end of a sentence

跟風遠走 提交于 2019-12-11 10:19:53

问题


String x=" i am going to the party at 6.00 in the evening. are you coming with me?";

if i have the above string, i need that to be broken to sentences by using sentence boundry punctuations(like . and ?)

but it should not split the sentence at 6 because of having an pointer there. is there a way to identify what is the correct sentence boundry place in java? i have tried using stringTokenizer in java.util pakage but it always break the sentence whenever it finds a pointer. Can someone suggest me a method to do this correctly?

This is the method which i have tried in tokenizing a text into sentences.

public static ArrayList<String> sentence_segmenter(String text) {
    ArrayList<String> Sentences = new ArrayList<String>();

    StringTokenizer st = new StringTokenizer(text, ".?!");
    while (st.hasMoreTokens()) {

        Sentences.add(st.nextToken());
    }
    return Sentences;
}

also i have a method to segement sentences into phrases, but here also when the program found comma(,) it splits the text. but i dont need to split it when there is a number like 60,000 with a comma in the middle. following is the method i am using to segment the phrases.

   public static ArrayList<String> phrasesSegmenter(String text) {
    ArrayList<String> phrases = new ArrayList<String>();
    StringTokenizer st = new StringTokenizer(text, ",");
    while (st.hasMoreTokens()) {
        phrases.add(st.nextToken());
    }
    return phrases;
}

回答1:


From the documentation of StringTokenizer:

StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.

In case you'd use split, you can use any regular expression to split the text into sentences. You probably want something like any of ?!. and either a space or end of text:

text.split("[?!.]($|\\s)")



回答2:


Here is my Solution to the problem.

/** tries to decide if a there's a sentence-end in index i of a given text

 * @param text
 * @param i
 * @return
 */
public static boolean isSentenceEnd(String text, int i) {
    char c = text.charAt(i);
    return isSentenceEndChar(c) && !isPeriodWord(text, i);
} 
/**
 * PeriodWords are words such as 'Dr.' or 'Mr.'
 *
 * @param text - the text to examoine.
 * @param i - index of the priod '.' character
 * @return
 */
private static String[] periodWords = { "Mr.", "Mrs.", "Ms.", "Prof.", "Dr.", "Gen.", "Rep.", "Sen.", "St.",
                "Sr.", "Jr.", "Ph.", "Ph.D.", "M.D.", "B.A.", "M.A.", "D.D.", "D.D.S.",
                "B.C.", "b.c.", "a.m.", "A.M.", "p.m.", "P.M.", "A.D.", "a.d.", "B.C.E.", "C.E.",
                "i.e.", "etc.", "e.g.", "al."};
private static boolean isPeriodWord(String text, int i) {
    if (i < 4) return true;
    if (text.charAt(i-2) == ' ') return true; // one char words are definetly priodWords
    String txt = text.substring(0, i);
    for (String pword: periodWords) {
        if (txt.endsWith(pword)) return true;
    }
    if (txt.matches("^.*\\d\\.$")) return true; // dates seperated with "." or numbers with fraction
    return false;
}

private static final char[] sentenceEndChars = {'.', '?', '−'};
private static boolean isSentenceEndChar(char c) {
    for (char sec : sentenceEndChars) {
        if (c == sec) return true;
    }
    return false;
}


来源:https://stackoverflow.com/questions/26704900/how-to-identify-a-end-of-a-sentence

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!