How to use sklearn's IncrementalPCA partial_fit

前端 未结 1 1805
难免孤独
难免孤独 2021-01-16 07:41

I\'ve got a rather large dataset that I would like to decompose but is too big to load into memory. Researching my options, it seems that sklearn\'s IncrementalPCA is a good

1条回答
  •  南笙
    南笙 (楼主)
    2021-01-16 08:22

    As you well guessed the fitting is done properly, although I would suggest increasing the chunk_size to 100 or 1000 (or even higher, depending on the shape of your data).

    What you have to do now to transform it, is actually transforming it:

    out = my_new_features_dataset # shape N x 2
    for i in range(0, num_rows//chunk_size):
        out[i*chunk_size:(i+1) * chunk_size] = ipca.transform(features[i*chunk_size : (i+1)*chunk_size])
    

    And thats should give you your new transformed features. If you still have too many samples to fit in memory, I would suggest using out as another hdf5 dataset.

    Also, I would argue that reducing a huge dataset to 2 components is probably not a very good idea. But is hard to say without knowing the shape of your features. I would suggest reducing them to sqrt(features.shape[1]), as it is a decent heuristic, or pro tip: use ipca.explained_variance_ratio_ to determine the best amount of features for your affordable information loss threshold.


    Edit: as for the explained_variance_ratio_, it returns a vector of dimension n_components (the n_components that you pass as parameter to IPCA) where each value i inicates the percentage of the variance of your original data explained by the i-th new component.

    You can follow the procedure in this answer to extract how much information is preserved by the first n components:

    >>> print(ipca.explained_variance_ratio_.cumsum())
    [ 0.32047581  0.59549787  0.80178824  0.932976    1.        ]
    

    Note: numbers are ficticius taken from the answer above assuming that you have reduced IPCA to 5 components. The i-th number indicates how much of the original data is explained by the first [0, i] components, as it is the cummulative sum of the explained variance ratio.

    Thus, what is usually done, is to fit your PCA to the same number of components than your original data:

    ipca = IncrementalPCA(n_components=features.shape[1])
    

    Then, after training on your whole data (with iteration + partial_fit) you can plot explaine_variance_ratio_.cumsum() and choose how much data you want to lose. Or do it automatically:

    k = np.argmax(ipca.explained_variance_ratio_.cumsum() > 0.9)
    

    The above will return the first index on the cumcum array where the value is > 0.9, this is, indicating the number of PCA components that preserve at least 90% of the original data.

    Then you can tweek the transformation to reflect it:

    cs = chunk_size
    out = my_new_features_dataset # shape N x k
    for i in range(0, num_rows//chunk_size):
        out[i*cs:(i+1)*cs] = ipca.transform(features[i*cs:(i+1)*cs])[:, :k]
    

    NOTE the slicing to :k to just select only the first k components while ignoring the rest.

    0 讨论(0)
提交回复
热议问题