How to do text classification with label probabilities?

问题

I'm trying to solve a text classification problem for academic purpose. I need to classify the tweets into labels like "cloud" ,"cold", "dry", "hot", "humid", "hurricane", "ice", "rain", "snow", "storms", "wind" and "other". Each tweet in training data has probabilities against all the label. Say the message "Can already tell it's going to be a tough scoring day. It's as windy right now as it was yesterday afternoon." has 21% chance for being hot and 79% chance for wind. I have worked on the classification problems which predicts whether its wind or hot or others. But in this problem, each training data has probabilities against all the labels. I have previously used mahout naive bayes classifier which take a specific label for a given text to build model. How to convert these input probabilities for various labels as input to any classifier?

回答1:

In a probabilistic setting, these probabilities reflect uncertainty about the class label of your training instance. This affects parameter learning in your classifier.

There's a natural way to incorporate this: in Naive Bayes, for instance, when estimating parameters in your models, instead of each word getting a count of one for the class to which the document belongs, it gets a count of probability. Thus documents with high probability of belonging to a class contribute more to that class's parameters. The situation is exactly equivalent to when learning a mixture of multinomials model using EM, where the probabilities you have are identical to the membership/indicator variables for your instances.

Alternatively, if your classifier were a neural net with softmax output, instead of the target output being a vector with a single [1] and lots of zeros, the target output becomes the probability vector you're supplied with.

I don't, unfortunately, know of any standard implementations that would allow you to incorporate these ideas.

回答2:

If you want an off the shelf solution, you could use a learner the supports multiclass classification and instance weights. Let's say you have k classes with probabilities p_1, ..., p_k. For each input instance, create k new training instances with identical features, and with label 1, ..., k, and assign weights p_1, ..., p_k respectively.

Vowpal Wabbit is one such learner that supports multiclass classification with instance weights.

来源：https://stackoverflow.com/questions/19221289/how-to-do-text-classification-with-label-probabilities

标签

machine-learning

classification

mahout

text-mining