What is the difference between Principal Component Analysis (PCA) and Feature Selection in Machine Learning? Is PCA a means of feature selection?
Just to add to the very good answers above. The difference is that PCA will try to reduce dimensionality by exploring how one feature of the data is expressed in terms of the other features(linear dependecy). Feature selection instead, takes the target into consideration. It will rank your input variables in terms of how useful they are to predict the target value. This is true for univariate feature selection. Multi variate feature selection can also do something that can be considered a form of PCA, in the sense that it will discard some of the features in the input. But don't take this analogy too far.
Just to add to the answer by @Roger Rowland. In the context of supervised learning (classification, regression) I like to think of PCA as a "feature transformer" rather then a feature selector.
PCA is based on extracting the axes on which data shows the highest variability. Although it “spreads out” data in the new basis, and can be of great help in unsupervised learning, there is no guarantee that the new axes are consistent with the discriminatory features in a supervised problem.
Put more simply, there is no guarantee at all that your top principal components are the most informative when it comes to predicting the dependent variable (e.g. class label).
This paper is a useful source. Another relevant crossvalidated link is here.
PCA is a way of finding out which features are important for best describing the variance in a data set. It's most often used for reducing the dimensionality of a large data set so that it becomes more practical to apply machine learning where the original data are inherently high dimensional (e.g. image recognition).
PCA has limitations though, because it relies on linear relationships between feature elements and it's often unclear what the relationships are before you start. As it also "hides" feature elements that contribute little to the variance in the data, it can sometimes eradicate a small but significant differentiator that would affect the performance of a machine learning model.
You can do feature selection with PCA.
Principal component analysis (PCA) is a technique that
"uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of uncorrelated variables called principal components."
The question that PCA helps us to answer fundamentally is this: Which of these M parameters explain a signficant amount of variation contained within the data set? PCA essentially helps to apply an 80-20 rule: can a small subset of parameters (say 20%) explain 80% or more of the variation in the data?
(see here)
But it has some shortcomings: it is sensitive to scale, and gives more weight to data with higher order of magnitude. Data normalization cannot always be the solution, as explained here:
http://www.simafore.com/blog/bid/105347/Feature-selection-with-mutual-information-Part-2-PCA-disadvantages
There are other ways to do feature selection:
A feature selection algorithm can be seen as the combination of a search technique for proposing new feature subsets, along with an evaluation measure which scores the different feature subsets. The simplest algorithm is to test each possible subset of features finding the one which minimises the error rate. This is an exhaustive search of the space, and is computationally intractable for all but the smallest of feature sets. The choice of evaluation metric heavily influences the algorithm, and it is these evaluation metrics which distinguish between the three main categories of feature selection algorithms: wrappers, filters and embedded methods.
(see here)
In some fields, feature extraction can suggest specific goals: in image processing, you may want to perform blob, edge or ridge detection.