Why did PCA reduced the performance of Logistic Regression?

牧云@^-^@ 提交于 2019-12-30 07:18:08

问题


I performed Logistic regression on a binary classification problem with data of 50000 X 370 dimensions.I got accuracy of about 90%.But when i did PCA + logistic on data, my accuracy reduced to 10%, I was very shocked to see this result. Can anybody explain what could have gone wrong?


回答1:


There is no guarantee that PCA will ever help, or not harm the learning process. In particular - if you use PCA to reduce amount of dimensions - you are removing information from your data, thus everything can happen - if the removed data was redundant, you will probably get better scores, if it was an important part of the problem - you will get worse. Even without dropping dimensions, but just "rotating" input space through PCA can both beneift and harm the process - one must remember that PCA is just a heuristic, when it comes to supervised learning. The only guarantee of PCA is that each consequtive dimension will explain less and less variance, and that it is the best affine transformation in terms of explaining variance in the first K dimensions. That's all. This can be completely unrelated to actual problem, as PCA does not consider labels at all. Given any dataset PCA will transform it in a way which depends only on the positions of points - so for some labelings (consistent with general shape of the data) - it might help, but for many others (more complex patterns of labels) - it will destroy the previously detectable relations. Futhermore, as PCA leads to change of some scalings, you might need different hyperparameters of your classifier - such as regularization strength for LR.

Now getting back to your problem - I would say that in your case the problem is ... a bug in your code. you cannot drop in accuracy significantly below 50%. 10% of accuracy means, that using the opposite of your classifier would give 90% (just answering "false" when it says "true" and the other way around). So even though PCA might not help (or might even harm, as described) - in your case it is an error in your code for sure.



来源:https://stackoverflow.com/questions/36668768/why-did-pca-reduced-the-performance-of-logistic-regression

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!