Stop words and stemmer in java

后端 未结 3 845
广开言路
广开言路 2021-02-06 16:02

I\'m thinking of putting a stop words in my similarity program and then a stemmer (going for porters 1 or 2 depends on what easiest to implement)

I was wondering that si

3条回答
  •  栀梦
    栀梦 (楼主)
    2021-02-06 16:32

    If you're not implementing this for academic reasons you should consider using the Lucene library. In either case it might be good for reference. It has classes for tokenization, stop word filtering, stemming and similarity. Here's a quick example using Lucene 3.0 to remove stop words and stem an input string:

    public static String removeStopWordsAndStem(String input) throws IOException {
        Set stopWords = new HashSet();
        stopWords.add("a");
        stopWords.add("I");
        stopWords.add("the");
    
        TokenStream tokenStream = new StandardTokenizer(
                Version.LUCENE_30, new StringReader(input));
        tokenStream = new StopFilter(true, tokenStream, stopWords);
        tokenStream = new PorterStemFilter(tokenStream);
    
        StringBuilder sb = new StringBuilder();
        TermAttribute termAttr = tokenStream.getAttribute(TermAttribute.class);
        while (tokenStream.incrementToken()) {
            if (sb.length() > 0) {
                sb.append(" ");
            }
            sb.append(termAttr.term());
        }
        return sb.toString();
    }
    

    Which if used on your strings like this:

    public static void main(String[] args) throws IOException {
        String one = "I decided buy something from the shop.";
        String two = "Nevertheless I decidedly bought something from a shop.";
        System.out.println(removeStopWordsAndStem(one));
        System.out.println(removeStopWordsAndStem(two));
    }
    

    Yields this output:

    decid bui someth from shop
    Nevertheless decidedli bought someth from shop
    

提交回复
热议问题