I currently want to calculate all-pair document similarity using cosine similarity and Tfidf features in python. My basic approach is the following:
from sklearn
What you could do is slice a row and a column of X, multiply those and save the resulting row to a file. Then move to the next row and column.
It is still the same amount of calculation work but you wouldn't run out of memory.
Using multiprocessing.Pool.map()
or multiprocessing.Pool.map_async()
you migt be able to speed it up, provided you use numpy.memmap()
to read the matrix in the mapped function. And you would probably have to write each of the calculated rows to a separate file to merge them later. If you were to return the row from the mapped function it would have to be transferred back to the original process. That would take a lot of memory and IPC bandwidth.