Java Lucene NGramTokenizer

后端 未结 4 1206
梦毁少年i
梦毁少年i 2021-01-04 05:31

I am trying tokenize strings into ngrams. Strangely in the documentation for the NGramTokenizer I do not see a method that will return the individual ngrams that were tokeni

相关标签:
4条回答
  • 2021-01-04 05:55

    Without creating a test program, I would guess that incrementToken() returns the next token which will be one of the ngrams.

    For example, using ngram lengths of 1-3 with the string 'a b c d', NGramTokenizer could return:

    a
    a b
    a b c
    b
    b c
    b c d
    c
    c d
    d
    

    where 'a', 'a b', etc. are the resulting ngrams.

    [Edit]

    You might also want to look at Querying lucene tokens without indexing, as it talks about peeking into the token stream.

    0 讨论(0)
  • 2021-01-04 06:05
    package ngramalgoimpl;
    import java.util.*;
    
    public class ngr {
    
        public static List<String> n_grams(int n, String str) {
            List<String> n_grams = new ArrayList<String>();
            String[] words = str.split(" ");
            for (int i = 0; i < words.length - n + 1; i++)
                n_grams.add(concatination(words, i, i+n));
            return n_grams;
        }
         /*stringBuilder is used to cancatinate mutable sequence of characters*/
        public static String concatination(String[] words, int start, int end) {
            StringBuilder sb = new StringBuilder();
            for (int i = start; i < end; i++)
                sb.append((i > start ? " " : "") + words[i]);
            return sb.toString();
        }
    
        public static void main(String[] args) {
            for (int n = 1; n <= 3; n++) {
                for (String ngram : n_grams(n, "This is my car."))
                    System.out.println(ngram);
                System.out.println();
            }
        }
    }
    
    0 讨论(0)
  • 2021-01-04 06:18

    For recent version of Lucene (4.2.1), this is a clean code which works. Before executing this code, you have to import 2 jar files:

    • lucene-core-4.2.1.jar
    • lucene-analuzers-common-4.2.1.jar

    Find these files at http://www.apache.org/dyn/closer.cgi/lucene/java/4.2.1

    //LUCENE 4.2.1
    Reader reader = new StringReader("This is a test string");      
    NGramTokenizer gramTokenizer = new NGramTokenizer(reader, 1, 3);
    
    CharTermAttribute charTermAttribute = gramTokenizer.addAttribute(CharTermAttribute.class);
    
    while (gramTokenizer.incrementToken()) {
        String token = charTermAttribute.toString();
        System.out.println(token);
    }
    
    0 讨论(0)
  • 2021-01-04 06:20

    I don't think you'll find what you're looking for trying to find methods returning String. You'll need to deal with Attributes.

    Should work something like:

    Reader reader = new StringReader("This is a test string");
    NGramTokenizer gramTokenizer = new NGramTokenizer(reader, 1, 3);
    CharTermAttribute charTermAttribute = gramTokenizer.addAttribute(CharTermAttribute.class);
    gramTokenizer.reset();
    
    while (gramTokenizer.incrementToken()) {
        String token = charTermAttribute.toString();
        //Do something
    }
    gramTokenizer.end();
    gramTokenizer.close();
    

    Be sure to reset() the Tokenizer it if it needs to be reused after that, though.


    Tokenizing grouping of words, rather than chars, per comments:

    Reader reader = new StringReader("This is a test string");
    TokenStream tokenizer = new StandardTokenizer(Version.LUCENE_36, reader);
    tokenizer = new ShingleFilter(tokenizer, 1, 3);
    CharTermAttribute charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);
    
    while (tokenizer.incrementToken()) {
        String token = charTermAttribute.toString();
        //Do something
    }
    
    0 讨论(0)
提交回复
热议问题