问题
I'm trying to get Rocchio algorithm for relevance feedback to work. I have a query, and a few documents marked positives and negatives. For example, I have 60 positives and 337 negatives. I want to train my model(in this case - adjust the query) using part of this dataset and test it on the other part. But having this kind of imbalanced dataset i'm not sure how many negatives and how many positives to take into training set.
Another problem is that depending on the positives/negatives proportion in test dataset I get misleading Precision, Recall and F1-score results. Having 49 positives and 17 negatives in test dataset gives me Precision=0.742, Recall=1.000 and F1=0.852, with number of TP=49, FP=17, TN=0, FN=0.
Distribution of positives/negatives proportion for other queries doesnt give me any hint on which proportion to choose for my model.
So what im asking you for is some advice on working with imbalanced datasets to get correct results.
Thanks in advance, sorry for such a noob(-ish?) question :-)
回答1:
First of all, I think that your algorithm will have a hard time generalizing from such a little number of examples (This depends on the number of features as well of course).
Secondly, I don't think that it is a very good idea to work with an imbalanced dataset. It seems that your algorithm hasn't learned anything since its output is always "positive". This means that if your dataset was balanced you would have a 50% accuracy. Not too good... If you cannot find a larger dataset, I would suggest that you split yours as such:
- Training set (45 positives / 45 negatives)
- Test set (15 positives / 15 negatives)
Anyway, I am still a student so that is what I think but it would be good if a more experienced user could confirm or infirm.
Hope it help!
来源:https://stackoverflow.com/questions/10734401/positives-negatives-proportion-in-train-set