text classification methods? SVM and decision tree

前端 未结 3 561
没有蜡笔的小新
没有蜡笔的小新 2021-02-05 16:27

i have a training set and i want to use a classification method for classifying other documents according to my training set.my document types are news and categories are sports

相关标签:
3条回答
  • 2021-02-05 16:57
    • Naive Bayes

    Though this is the simplest algorithm and everything is deemed independent, in real text classification case, this method work great. And I would try this algorithm first for sure.

    • KNN

    KNN is for clustering rather than classification. I think you misunderstand the conception of clustering and classification.

    • SVM

    SVM has SVC(classification) and SVR(Regression) algorithms to do class classification and prediction. It sometime works good, but from my experiences, it has bad performance in text classification, as it has high demands for good tokenizers (filters). But the dictionary of the dataset always has dirty tokens. The accuracy is really bad.

    • Random Forest (decision tree)

    I've never try this method for text classification. Because I think decision tree need several key nodes, while it's hard to find "several key tokens" for text classification, and random forest works bad for high sparse dimensions.

    FYI

    These are all from my experiences, but for your case, you have no better ways to decide which methods to use but to try every algorithm to fit your model.

    Apache's Mahout is a great tool for machine learning algorithms. It integrates three aspects' algorithms: recommendation, clustering, and classification. You could try this library. But you have to learn some basic knowledge about Hadoop.

    And for machine learning, weka is a software toolkit for experiences which integrates many algorithms.

    0 讨论(0)
  • 2021-02-05 17:01

    Linear SVMs are one of the top algorithms for text classification problems (along with Logistic Regression). Decision Trees suffer badly in such high dimensional feature spaces.

    The Pegasos algorithm is one of the simplest Linear SVM algorithms and is incredibly effective.

    EDIT: Multinomial Naive bayes also works well on text data, though not usually as well as Linear SVMs. kNN can work okay, but its an already slow algorithm and doesn't ever top the accuracy charts on text problems.

    0 讨论(0)
  • 2021-02-05 17:04

    If you are familiar with Python, you may consider NLTK and scikit-learn. The former is dedicated to NLP while the latter is a more comprehensive machine learning package (but it has a great inventory of text processing modules). Both are open source and have great community suport on SO.

    0 讨论(0)
提交回复
热议问题