Text Classification into Categories

雨燕双飞 提交于 2019-12-18 13:34:52

问题


I am working on a text classification problem, I am trying to classify a collection of words into category, yes there are plenty of libraries available for classification, so please dont answer if you are suggesting to use them.

Let me explain what I want to implement. ( take for example )

List of Words:

  1. java
  2. programming
  3. language
  4. c-sharp

List of Categories.

  1. java
  2. c-sharp

here we will train the set, as:

  1. java maps to category 1. java
  2. programming maps to category 1.java
  3. programming maps to category 2.c-sharp
  4. language maps to category 1.java
  5. language maps to category 2.c-sharp
  6. c-sharp maps to category 2.c-sharp

Now we have a phrase "The best java programming book" from the given phrase following words are a match to our "List of Words.":

  1. java
  2. programming

"programming" has two mapped categories "java" & "c-sharp" so it is a common word.

"java" is mapped to category "java" only.

So our matching category for the phrase is "java"

This is what came to my mind, is this solution fine, can it be implemented, what are your suggestions, any thing I am missing out, flaws, etc..


回答1:


Of course this can be implemented. If you train a Naive Bayes classifier or linear SVM on the right dataset (titles of Java and C# programming books, I guess), it should learn to associate the term "Java" with Java, "C#" and ".NET" with C#, and "programming" with both. I.e., a Naive Bayes classifier would likely learn a roughly even probability of Java or C# for common terms like "programming" if the dataset is divided evenly.




回答2:


A dirt simple way of implementing this is using straight-up Lucene (or any text-indexing engine). Create a single Lucene document with all of the "java" examples, and another document with the "c#" examples, and add both to the index. To classify a new document, OR all the terms in the document and execute a query against the index, and grab the category with the highest score.




回答3:


If possible then read the section called "A Naive Classifier" in chapter "Document Filtering" in book called "Programming Collective Intelligence". Although the examples are in Python, I hope that will not be of much trouble to you.



来源:https://stackoverflow.com/questions/8136677/text-classification-into-categories

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!