How to classify documents indexed with lucene

前端 未结 3 1378
醉梦人生
醉梦人生 2021-02-10 02:40

I have classified a set of documents with Lucene (fields: content, category). Each document has it\'s own category, but some of them are labeled as uncategorized. Is there any w

3条回答
  •  伪装坚强ぢ
    2021-02-10 03:11

    Classification is a broad problem in the field of Machine Learning/Statistics. After reading your question what I feel you have used kind of SQL group by clause (though in Lucene). If you want the machine to classify the documents than you need to know Machine Learning Algorithms like Neural Networks, Bayesian, SVM etc. There are excellent libraries available in Java for these tasks. For this to work you will need features (a set of attributes extracted from data) on which you can train you Algorithm so that it may predict your classification label.

    There are some good API's in Java (which allows you to concentrate on code without going in too much in understanding the mathematical theory behind those Algorithms, though if you know it would be very advantageous). Weka is good. I also came across a couple of books from Manning which have handled these tasks well. Here you go:

    Chapter 10 (Classification) of Collective Intelligence in Action: http://www.manning.com/alag/

    Chapter 5 (Classification) of Algorithms of Intelligent Web: http://www.manning.com/marmanis/

    These are absolutely fantastic material (for Java people) on classification particularly suited for people who just dont want to dive in in to the theory (though very essential :)) and just quickly want a working code.

    Collective Intelligence in Action has solved the problem of classification using JDM and Weka. So have a look at these two for your tasks.

提交回复
热议问题