Java Lucene NGramTokenizer

后端 未结 4 1205
梦毁少年i
梦毁少年i 2021-01-04 05:31

I am trying tokenize strings into ngrams. Strangely in the documentation for the NGramTokenizer I do not see a method that will return the individual ngrams that were tokeni

4条回答
  •  别那么骄傲
    2021-01-04 06:20

    I don't think you'll find what you're looking for trying to find methods returning String. You'll need to deal with Attributes.

    Should work something like:

    Reader reader = new StringReader("This is a test string");
    NGramTokenizer gramTokenizer = new NGramTokenizer(reader, 1, 3);
    CharTermAttribute charTermAttribute = gramTokenizer.addAttribute(CharTermAttribute.class);
    gramTokenizer.reset();
    
    while (gramTokenizer.incrementToken()) {
        String token = charTermAttribute.toString();
        //Do something
    }
    gramTokenizer.end();
    gramTokenizer.close();
    

    Be sure to reset() the Tokenizer it if it needs to be reused after that, though.


    Tokenizing grouping of words, rather than chars, per comments:

    Reader reader = new StringReader("This is a test string");
    TokenStream tokenizer = new StandardTokenizer(Version.LUCENE_36, reader);
    tokenizer = new ShingleFilter(tokenizer, 1, 3);
    CharTermAttribute charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);
    
    while (tokenizer.incrementToken()) {
        String token = charTermAttribute.toString();
        //Do something
    }
    

提交回复
热议问题