In light of this article, I am wondering what people\'s experiences are with storing massive datasets (say, >10,000,000 objects) in-memory using arrays to store data fields inst
[update 2011-07-19]
there's a new version available now: http://www.mediafire.com/file/74fxj7u1n0ppcq9/MemStorageDemo-6639584-2011_07_19-12_47_00.zip
i am still trying to debug some of the refcounting which is annoying, but from a fresh xUnit session, i am able to run a test that creates 10 million objects (it occurrs to me i have the string size cut down right now for testing but i had it running 10 million with strings of variable length from 3 to 15 bytes, i haven't had a chance to try bigger than that yet. on my system i was going from approx ~1.95G load to ~2.35G with the 10 million objects and i still haven't done anything about the strings other than a very simple backing class that uses actual managed strings.
anyway, so, i think it's fairly well working, although there's definitely optimizing left to be done on the backing store, and i also think a bit of work can be done on the iterators if necessary, depending on how much data you process at once. unfortunately just not going to be able to look at it again until probably late tomorrow or next day.
anyway, so here's the basic ideas:
MemoryArray
class: uses unmanaged memory allocated by Marhsal.AllocHGlobal()
to store the structures. i was testing with one big MemoryArray
but member-wise there are only a few fields and i think as long as you kept the array size fairly large it's not going to be much of a difference in memory consumption to split it up. MemoryArray
implements IEnumerable
which is what i've been using to fill the arrays in testing.MemoryArray
is intended for holding the regularly-sized chunks of data from any object being backed. there are some things you can do with the enumerators using pointer math that i have not yet implemented. i am currently returning new objects every time, so that's a big chunk of the size i believe. in the prototype class i based this prototype on, i was able to use very regular pointer math for doing traversals, however i think the way i was doing it would be mainly useful for very quick traversals and probably not for interoperability.
MemoryArray
has just a standard indexer that grabs the requested element using pointer math on the Int
Ptr that represents the head of the unmanaged data which is allocated upon construction. I have also implemented a notional 2D indexer, where you can pass in an array of ints representing dimensions to the Array, and then can perform a A[x,y] on it. it's just a quick little example of how it could work.
one thing i haven't implemented yet is any sort of subsectioning, but i do think a subsection is appropriate for the project so i will probably implement that when i get a chance.
MemoryArrayEnumerator
class: i chose to implement actual enumerators instead of enumerator functions. the enumerator class basically just takes a MemoryArray
and then provides the stock functions for enumerators to return the actual MemoryArrayItem
objects.
MemoryArrayItem
class: this class doesn't do much other than just hold the appropraite pointer information based off the start and position in the array. it's the specific classes that get implemented on top of (to the side of?) this object that actually do the pointer stuff to get the data out.
then there are a few more support classes, MemoryStringArray
is the variable-sized memory backing chunk that only does strings for the time being, and then there's the auto disposing class (AutoDisposer
) and a generic class to handle the attaches and detaches. (AutoReference<>
).
now, on top of this foundation are the specific classes. each of the three types (array/enumerator/item) are implemented specifically for the objects you're looking at. in the more massive version of this project that i gave up on and that this is a spike for, i have more generic handling of the offsets and such so that you're not so tied to concrete classes, but even as they are they are pretty useful; my original implementation was all concrete classes with no real base.
currently i have them all implemented as separate classes that you pass references to; so, TestArray
class is passed a MemoryArray
class in its constructor. same with enumerator and item. i know there would be some processing power to be saved there, and i think there's a good possibility for space savings as well, if i can figure out a good way to implement them as descendants rather than just having a copy of the underlying classes. i wanted to get the basic feel for it first though, and that seemed to be the most straightforward way to go. the issue is that it's another layer of indirection.
TestArray
and TestArrayEnumerator
turned out not to do too much other than just pass through the functionality of MemoryArray
and MemoryArrayEnumerator
. the main issue in those classes is just getting the pointers carved up and passed out and into the items that are using it.
but, so TestArrayItem is where the pointers are actually switched into real data; here's the file. i snipped a big section of comments that goes through some options for a better way to handle the variable-length backing store (still in the actual file in the link given above), and please excuse the multitude of comments where i leave myself notes about what i'm thinking as i work on it :)
TestArrayItem.cs
// ----------------------------------------------
// rights lavished upon all with love
// see 'license/unlicense.txt'
// ♥ 2011, shelley butterfly - public domain
// ----------------------------------------------
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace MemStorageDemo
{
public unsafe class TestArrayItem
{
public static MemoryStringArray s_MemoryArrayStore = new MemoryStringArray();
public static int TestArrayItemSize_bytes =
sizeof(double) * 2
+ sizeof(int)
+ sizeof(int);
// hard-coding here; this is another place that things could be a little more generic if you wanted and if
// performance permitted; for instance, creating a dictionary of offsets based on index. also, perhaps
// a dictionary with key strings to allow indexing using field names of the object.
private enum EFieldOffset
{
DoubleTheFirstOffset = 0,
DoubleTheSecondOffset = 8,
IntTheFirstOffset = 16,
StringTheFirstHandleOffset = 20
}
private MemoryArrayItem myMemoryArrayItem;
private MemoryStringArray myStringStore;
// constructor that uses the static string array store
public TestArrayItem(MemoryArrayItem parMemoryArrayItem) :
this(parMemoryArrayItem, s_MemoryArrayStore)
{
}
// constructor for getting the item at its memory block without any initialization (e.g. existing item)
public TestArrayItem(MemoryArrayItem parMemoryArrayItem, MemoryStringArray parStringStore)
{
myMemoryArrayItem = parMemoryArrayItem;
myStringStore = parStringStore;
}
// constructor for geting the item at its memory block and initializing it (e.g. adding new items)
public TestArrayItem(MemoryArrayItem parMemoryArrayItem, double parDoubleTheFirst, double parDoubleTheSecond, int parIntTheFirst, string parStringTheFirst)
{
myMemoryArrayItem = parMemoryArrayItem;
DoubleTheFirst = parDoubleTheFirst;
DoubleTheSecond = parDoubleTheSecond;
IntTheFirst = parIntTheFirst;
StringTheFirst = parStringTheFirst;
}
// if you end up in a situation where the compiler isn't giving you equivalent performance to just doing
// the array math directly in the properties, you could always just do the math directly in the properties.
//
// it reads much cleaner the way i have it set up, and there's a lot less code duplication, so without
// actually determining empirically that i needed to do so, i would stick with the function calls.
private IntPtr GetPointerAtOffset(EFieldOffset parFieldOffset)
{ return myMemoryArrayItem.ObjectPointer + (int)parFieldOffset; }
private double* DoubleTheFirstPtr
{ get { return (double*)GetPointerAtOffset(EFieldOffset.DoubleTheFirstOffset); } }
public double DoubleTheFirst
{
get
{
return *DoubleTheFirstPtr;
}
set
{
*DoubleTheFirstPtr = value;
}
}
private double* DoubleTheSecondPtr
{ get { return (double*)GetPointerAtOffset(EFieldOffset.DoubleTheSecondOffset); } }
public double DoubleTheSecond
{
get
{
return *DoubleTheSecondPtr;
}
set
{
*DoubleTheSecondPtr = value;
}
}
// ahh wishing for a preprocessor about now
private int* IntTheFirstPtr
{ get { return (int*)GetPointerAtOffset(EFieldOffset.IntTheFirstOffset); } }
public int IntTheFirst
{
get
{
return *IntTheFirstPtr;
}
set
{
*IntTheFirstPtr = value;
}
}
// okay since we're using the StringArray backing store in the example, we just need to get the
// pointer stored in our blocks, and then copy the data from that address
private int* StringTheFirstHandlePtr
{ get { return (int*)GetPointerAtOffset(EFieldOffset.StringTheFirstHandleOffset); } }
public string StringTheFirst
{
get
{
return myStringStore.GetString(*StringTheFirstHandlePtr);
}
set
{
myStringStore.ModifyString(*StringTheFirstHandlePtr, value);
}
}
public void CreateStringTheFirst(string WithValue)
{
*StringTheFirstHandlePtr = myStringStore.AddString(WithValue);
}
public override string ToString()
{
return string.Format("{0:X8}: {{ {1:0.000}, {2:0.000}, {3}, {4} }} {5:X8}", (int)DoubleTheFirstPtr, DoubleTheFirst, DoubleTheSecond, IntTheFirst, StringTheFirst, (int)myMemoryArrayItem.ObjectPointer);
}
}
}
so, that's the real magic; just basically implementing functions that figure out the right pointers based on the info about the fields. as it stands i think it's a pretty good candidate for code gen, well, assuming i get the attach/detach stuff working right. i hid a lot of having to do manual memory management using auto-type pointers and i think it will be worth the debug in the long run...
anyway, that's about it, i'll hopefully be back with internet this evening or tomorrow, and i'll do my best to check in. unfortunately about to go give the cable modem back to the cable company so won't be in business unless we go to a mcdonalds or something :) hope this has been some sort of help; i'll debug the issues with it so that there's at least a functional base to work from; i know this isn't the first time i've thought about writing a library like this and i imagine others have too.
i have used something like this for using a COM library that i created to interop with the pre-compiled win32/64 FFTW dll. for this question we really need something a little more generic than what i had, so i started working on something last week that would be sufficient as a decent generic library for these types of uses, with extensible memory management, multi-dimensions, slicing, etc.
well, i finally admimtted to myself yesterday that (a) it was going to be another few days before it was ready and (b) i needed a spike solution to figure out some of the last bits anyway. so, i decided to just do a first cut at a lower level of abstraction that attempts to meed the needs addressed in your question. unfortunately we are about to move and i have to do some packing but i think i'ts probably enough to address your question.
i am going to continue work on the example this week and i will update here with new code and i will attempt to distill enough info from it to make a post that explains it, but if you're still interested, here's the last i'll probably be able to post until later in the week:
http://www.mediafire.com/file/a7yq53ls18q7bvf/EfficientStorage-6639584.zip
it's just a VS2010 solution with the basic necessities for doing what i think meets your needs. there are lots of comments sprinkled through, but feel free to ask questions and i will check back as soon as i have internet... either way it's been a fun project.
after completing the example i intend to finish off the first iteration full library and release it somewhere; i will update here with the link when i have it.
obligatory warning: it compiles but has not been tested; i'm sure there are issues, it's a pretty big topic.