Basically, what is the best way to go about storing and using dense matrices in python?
I have a project that generates similarity metrics between every item in an array
If you have N objects, kept in a list L, and you wish to store the similarity betwen each object and each other object, that's O(N**2)
similarities. Under the common conditions that similarity(A, B) == similarity(B, A)
and similarity(A, A) == 0
, all that you need is a triangular array S of similarities. The number of elements in that array will be N*(N-1)//2
. You should be able to use an array.array for this purpose. Keeping your similarity as a float will take only 8 bytes. If you can represent your similarity as an integer in range(256)
, you use an unsigned byte as the array.array element.
So that's about 8000 * 8000 / 2 * 8 i.e. about 256 MB. Using only a byte for the similarity means only 32 MB. You could avoid the slow S[i*N-i*(i+1)//2+j]
index calculation of the triangle thingie by simulating a square array instead using S[i*N+j]`; memory would double to (512 MB for float, 64 MB for byte).
If the above doesn't suit you, then perhaps you could explain """Each item [in which container?] is a custom class, and stores a pointer to the other class and a number representing it's "closeness" to that class."" and """I don't want to store numbers, I want to store reference to classes""". Even after replacing "class(es)" by "object(s)", I'm struggling to understand what you mean.