Python PCA on Matrix too large to fit into memory

后端 未结 2 1231
臣服心动
臣服心动 2020-12-20 18:10

I have a csv that is 100,000 rows x 27,000 columns that I am trying to do PCA on to produce a 100,000 rows X 300 columns matrix. The csv is 9GB large. Here is currently what

相关标签:
2条回答
  • 2020-12-20 18:46

    PCA needs to compute a correlation matrix, which would be 100,000x100,000. If the data is stored in doubles, then that's 80 GB. I would be willing to bet your Macbook does not have 80 GB RAM.

    The PCA transformation matrix is likely to be nearly the same for a reasonably sized random subset.

    0 讨论(0)
  • 2020-12-20 19:00

    Try to divide your data or load it by batches into script, and fit your PCA with Incremetal PCA with it's partial_fit method on every batch.

    from sklearn.decomposition import IncrementalPCA
    import csv
    import sys
    import numpy as np
    import pandas as pd
    
    dataset = sys.argv[1]
    chunksize_ = 5 * 25000
    dimensions = 300
    
    reader = pd.read_csv(dataset, sep = ',', chunksize = chunksize_)
    sklearn_pca = IncrementalPCA(n_components=dimensions)
    for chunk in reader:
        y = chunk.pop("Y")
        sklearn_pca.partial_fit(chunk)
    
    # Computed mean per feature
    mean = sklearn_pca.mean_
    # and stddev
    stddev = np.sqrt(sklearn_pca.var_)
    
    Xtransformed = None
    for chunk in pd.read_csv(dataset, sep = ',', chunksize = chunksize_):
        y = chunk.pop("Y")
        Xchunk = sklearn_pca.transform(chunk)
        if Xtransformed == None:
            Xtransformed = Xchunk
        else:
            Xtransformed = np.vstack((Xtransformed, Xchunk))
    

    Useful link

    0 讨论(0)
提交回复
热议问题