Incremental PCA on big data

前端 未结 1 1013
轻奢々
轻奢々 2020-12-09 11:20

I just tried using the IncrementalPCA from sklearn.decomposition, but it threw a MemoryError just like the PCA and RandomizedPCA before. My problem is, that the matrix I am

相关标签:
1条回答
  • 2020-12-09 12:16

    You program is probably failing in trying to load the entire dataset into RAM. 32 bits per float32 × 1,000,000 × 1000 is 3.7 GiB. That can be a problem on machines with only 4 GiB RAM. To check that it's actually the problem, try creating an array of this size alone:

    >>> import numpy as np
    >>> np.zeros((1000000, 1000), dtype=np.float32)
    

    If you see a MemoryError, you either need more RAM, or you need to process your dataset one chunk at a time.

    With h5py datasets we just should avoid passing the entire dataset to our methods, and pass slices of the dataset instead. One at a time.

    As I don't have your data, let me start from creating a random dataset of the same size:

    import h5py
    import numpy as np
    h5 = h5py.File('rand-1Mx1K.h5', 'w')
    h5.create_dataset('data', shape=(1000000,1000), dtype=np.float32)
    for i in range(1000):
        h5['data'][i*1000:(i+1)*1000] = np.random.rand(1000, 1000)
    h5.close()
    

    It creates a nice 3.8 GiB file.

    Now, if we are in Linux, we can limit how much memory is available to our program:

    $ bash
    $ ulimit -m $((1024*1024*2))
    $ ulimit -m
    2097152
    

    Now if we try to run your code, we'll get the MemoryError. (press Ctrl-D to quit the new bash session and reset the limit later)

    Let's try to solve the problem. We'll create an IncrementalPCA object, and will call its .partial_fit() method many times, providing a different slice of the dataset each time.

    import h5py
    import numpy as np
    from sklearn.decomposition import IncrementalPCA
    
    h5 = h5py.File('rand-1Mx1K.h5', 'r')
    data = h5['data'] # it's ok, the dataset is not fetched to memory yet
    
    n = data.shape[0] # how many rows we have in the dataset
    chunk_size = 1000 # how many rows we feed to IPCA at a time, the divisor of n
    ipca = IncrementalPCA(n_components=10, batch_size=16)
    
    for i in range(0, n//chunk_size):
        ipca.partial_fit(data[i*chunk_size : (i+1)*chunk_size])
    

    It seems to be working for me, and if I look at what top reports, the memory allocation stays below 200M.

    0 讨论(0)
提交回复
热议问题