Positives/negatives proportion in train set

半城伤御伤魂 提交于 2019-12-08 07:59:57

问题


I'm trying to get Rocchio algorithm for relevance feedback to work. I have a query, and a few documents marked positives and negatives. For example, I have 60 positives and 337 negatives. I want to train my model(in this case - adjust the query) using part of this dataset and test it on the other part. But having this kind of imbalanced dataset i'm not sure how many negatives and how many positives to take into training set.

Another problem is that depending on the positives/negatives proportion in test dataset I get misleading Precision, Recall and F1-score results. Having 49 positives and 17 negatives in test dataset gives me Precision=0.742, Recall=1.000 and F1=0.852, with number of TP=49, FP=17, TN=0, FN=0.

Distribution of positives/negatives proportion for other queries doesnt give me any hint on which proportion to choose for my model.

So what im asking you for is some advice on working with imbalanced datasets to get correct results.

Thanks in advance, sorry for such a noob(-ish?) question :-)


回答1:


First of all, I think that your algorithm will have a hard time generalizing from such a little number of examples (This depends on the number of features as well of course).

Secondly, I don't think that it is a very good idea to work with an imbalanced dataset. It seems that your algorithm hasn't learned anything since its output is always "positive". This means that if your dataset was balanced you would have a 50% accuracy. Not too good... If you cannot find a larger dataset, I would suggest that you split yours as such:

  • Training set (45 positives / 45 negatives)
  • Test set (15 positives / 15 negatives)

Anyway, I am still a student so that is what I think but it would be good if a more experienced user could confirm or infirm.

Hope it help!



来源:https://stackoverflow.com/questions/10734401/positives-negatives-proportion-in-train-set

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!