Example using WikipediaTokenizer in Lucene

后端未结

关注

 3  834

I want to use WikipediaTokenizer in lucene project - http://lucene.apache.org/java/3_0_2/api/contrib-wikipedia/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.html B

相关标签:

3条回答

后悔当初

2021-01-16 12:41

In Lucene 3.0, next() method is removed. Now you should use incrementToken to iterate through the tokens and it returns false when you reach the end of the input stream. To obtain the each token, you should use the methods of the AttributeSource class. Depending on the attributes that you want to obtain (term, type, payload etc), you need to add the class type of the corresponding attribute to your tokenizer using addAttribute method.

Following partial code sample is from the test class of the WikipediaTokenizer which you can find if you download the source code of the Lucene.

...
WikipediaTokenizer tf = new WikipediaTokenizer(new StringReader(test));
int count = 0;
int numItalics = 0;
int numBoldItalics = 0;
int numCategory = 0;
int numCitation = 0;
TermAttribute termAtt = tf.addAttribute(TermAttribute.class);
TypeAttribute typeAtt = tf.addAttribute(TypeAttribute.class);

while (tf.incrementToken()) {
  String tokText = termAtt.term();
  //System.out.println("Text: " + tokText + " Type: " + token.type());
  String expectedType = (String) tcm.get(tokText);
  assertTrue("expectedType is null and it shouldn't be for: " + tf.toString(), expectedType != null);
  assertTrue(typeAtt.type() + " is not equal to " + expectedType + " for " + tf.toString(), typeAtt.type().equals(expectedType) == true);
  count++;
  if (typeAtt.type().equals(WikipediaTokenizer.ITALICS)  == true){
    numItalics++;
  } else if (typeAtt.type().equals(WikipediaTokenizer.BOLD_ITALICS)  == true){
    numBoldItalics++;
  } else if (typeAtt.type().equals(WikipediaTokenizer.CATEGORY)  == true){
    numCategory++;
  }
  else if (typeAtt.type().equals(WikipediaTokenizer.CITATION)  == true){
    numCitation++;
  }
}
...

0 讨论(0)

[愿得一人]

2021-01-16 12:43

WikipediaTokenizer tf = new WikipediaTokenizer(new StringReader(test));

Token token = new Token();

token = tf.next(token);

http://www.javadocexamples.com/java_source/org/apache/lucene/wikipedia/analysis/WikipediaTokenizerTest.java.html

Regards

0 讨论(0)
发布评论:

提交评论
- 加载中...

春和景丽

2021-01-16 12:50

public class WikipediaTokenizerTest { static Logger logger = Logger.getLogger(WikipediaTokenizerTest.class); protected static final String LINK_PHRASES = "click [[link here again]] click [http://lucene.apache.org here again] [[Category:a b c d]]";

public WikipediaTokenizer testSimple() throws Exception {
    String text = "This is a [[Category:foo]]";
    return new WikipediaTokenizer(new StringReader(text));
}
public static void main(String[] args){
    WikipediaTokenizerTest wtt = new WikipediaTokenizerTest();

    try {
        WikipediaTokenizer x = wtt.testSimple();

        logger.info(x.hasAttributes());

        Token token = new Token();
        int count = 0;
        int numItalics = 0;
        int numBoldItalics = 0;
        int numCategory = 0;
        int numCitation = 0;

        while (x.incrementToken() == true) {
            logger.info("seen something");
        }

    } catch(Exception e){
        logger.error("Exception while tokenizing Wiki Text: " + e.getMessage());
    }


}

0 讨论(0)