In light of this article, I am wondering what people\'s experiences are with storing massive datasets (say, >10,000,000 objects) in-memory using arrays to store data fields inst
In light of this article, I am wondering what people's experiences are with storing massive datasets (say, >10,000,000 objects) in-memory using arrays to store data fields instead of instantiating millions of objects and racking up the memory overhead...
I guess there are several ways to approach this, and indeed you are onto a possible solution to limit the data in memory. However, I'm not sure that reducing your structure by even 24? bytes is going to do you a whole lot of good. Your structure is around 79 bytes (for a 15 char string) = 8 + 8 + 4 + 24? + 4 + 1 + (2 * character length) so your total gain is at best 25%. That doesn't seem very useful since you'd have to be in a position where 10 million * 80 bytes fits in memory and 10 million * 100 bytes does not. That would mean that your designing a solution that is on the edge of disaster, too many large strings, or too many records, or some other program hogging memory and your machine is out of memory.
If you need to support random access to n small records, where n = 10 million, then you should aim to design for at least 2n or 10n. Perhaps your already considering this in your 10 million? Either way there are plenty of technologies that can support this type of data being accessed.
One possibility is if the string is limited in Max Length (ml), of a reasonable size (say 255) then you can go to a simple ISAM store. Each record would be 8 + 8 + 4 + 255 bytes and you can simply offset into a flat file to read them. If the record size is variable or possibly large then you will want to use a different storage format for this and store offsets into the file.
Another possibility is if your looking up values by some key then I would recommend something like an embedded database, or BTree, one you can disable some of the disk consistency to gain the performance. As it happens I wrote a BPlusTree for client-side caches of large volumes of data. Detailed information on using the B+Tree are here.