Example using WikipediaTokenizer in Lucene

后端 未结 3 834
独厮守ぢ
独厮守ぢ 2021-01-16 11:49

I want to use WikipediaTokenizer in lucene project - http://lucene.apache.org/java/3_0_2/api/contrib-wikipedia/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.html B

相关标签:
3条回答
  • 2021-01-16 12:41

    In Lucene 3.0, next() method is removed. Now you should use incrementToken to iterate through the tokens and it returns false when you reach the end of the input stream. To obtain the each token, you should use the methods of the AttributeSource class. Depending on the attributes that you want to obtain (term, type, payload etc), you need to add the class type of the corresponding attribute to your tokenizer using addAttribute method.

    Following partial code sample is from the test class of the WikipediaTokenizer which you can find if you download the source code of the Lucene.

    ...
    WikipediaTokenizer tf = new WikipediaTokenizer(new StringReader(test));
    int count = 0;
    int numItalics = 0;
    int numBoldItalics = 0;
    int numCategory = 0;
    int numCitation = 0;
    TermAttribute termAtt = tf.addAttribute(TermAttribute.class);
    TypeAttribute typeAtt = tf.addAttribute(TypeAttribute.class);
    
    while (tf.incrementToken()) {
      String tokText = termAtt.term();
      //System.out.println("Text: " + tokText + " Type: " + token.type());
      String expectedType = (String) tcm.get(tokText);
      assertTrue("expectedType is null and it shouldn't be for: " + tf.toString(), expectedType != null);
      assertTrue(typeAtt.type() + " is not equal to " + expectedType + " for " + tf.toString(), typeAtt.type().equals(expectedType) == true);
      count++;
      if (typeAtt.type().equals(WikipediaTokenizer.ITALICS)  == true){
        numItalics++;
      } else if (typeAtt.type().equals(WikipediaTokenizer.BOLD_ITALICS)  == true){
        numBoldItalics++;
      } else if (typeAtt.type().equals(WikipediaTokenizer.CATEGORY)  == true){
        numCategory++;
      }
      else if (typeAtt.type().equals(WikipediaTokenizer.CITATION)  == true){
        numCitation++;
      }
    }
    ...
    
    0 讨论(0)
  • 2021-01-16 12:43

    WikipediaTokenizer tf = new WikipediaTokenizer(new StringReader(test));

    Token token = new Token();

    token = tf.next(token);

    http://www.javadocexamples.com/java_source/org/apache/lucene/wikipedia/analysis/WikipediaTokenizerTest.java.html

    Regards

    0 讨论(0)
  • 2021-01-16 12:50

    public class WikipediaTokenizerTest { static Logger logger = Logger.getLogger(WikipediaTokenizerTest.class); protected static final String LINK_PHRASES = "click [[link here again]] click [http://lucene.apache.org here again] [[Category:a b c d]]";

    public WikipediaTokenizer testSimple() throws Exception {
        String text = "This is a [[Category:foo]]";
        return new WikipediaTokenizer(new StringReader(text));
    }
    public static void main(String[] args){
        WikipediaTokenizerTest wtt = new WikipediaTokenizerTest();
    
        try {
            WikipediaTokenizer x = wtt.testSimple();
    
            logger.info(x.hasAttributes());
    
            Token token = new Token();
            int count = 0;
            int numItalics = 0;
            int numBoldItalics = 0;
            int numCategory = 0;
            int numCitation = 0;
    
            while (x.incrementToken() == true) {
                logger.info("seen something");
            }
    
        } catch(Exception e){
            logger.error("Exception while tokenizing Wiki Text: " + e.getMessage());
        }
    
    
    }
    
    0 讨论(0)
提交回复
热议问题