A space efficient data structure to store and look-up through a large set of (uniformly distributed) Integers

后端 未结 7 2161
悲哀的现实
悲哀的现实 2021-01-06 22:28

I\'m required to hold, in memory, and look-up through one million uniformly distributed integers. My workload is extremely look-up intensive.
My current implementation u

7条回答
  •  挽巷
    挽巷 (楼主)
    2021-01-06 22:53

    While Jon Skeet's answer gives good savings for a small investment, I think you can do better. Since your numbers are fairly even distributed, you can use an interpolating search for faster lookups (roughly O(log log N) instead of O(log N)). For a million items, you can probably plan on around 4 comparisons instead of around 20.

    If you want to do just a little more work to cut the memory (roughly) in half again, you could build it as a two-level lookup table, basically a sort of simple version of a trie.

    enter image description here

    You'd break your (presumably) 32-bit integer into two 16-bit pieces. You'd use the first 16 bits as an index into the first level of the lookup table. At this level, you'd have 65536 pointers, one for each possible 16-bit value for that part of your integer. That would take you to the second level of the table. For this part, we'd do a binary or interpolation search between the chosen pointer, and the next one up -- i.e., all the values in the second level that had that same value in the first 16 bits.

    When we look in the second table, however, we already know 16 bits of the value -- so instead of storing all 32 bits of the value, we only have to store the other 16 bits of the value.

    That means instead of the second level occupying 4 megabytes, we've reduced it to 2 megabytes. Along with that we need the first level table, but it's only 65536x4=256K bytes.

    This will almost certainly improve speed over a binary search of the entire data set. In the worst case (using a binary search for the second level) we could have as many as 17 comparisons (1 + log2 65536). The average will be better than that though -- since we have only a million items, there can only be an average of 1_000_000/65536 = ~15 items in each second-level "partition", giving approximately 1 + log2(16) = 5 comparisons. Using an interpolating search at the second level might reduce that a little further, but when you're only starting with 5 comparisons, you don't have much room left for really dramatic improvements. Given an average of only ~15 items at the second level, the type of search you use won't make much difference -- even a linear search is going to be pretty fast.

    Of course, if you wanted to you could go a step further and use a 4-level table instead (one for each byte in the integer). It may be open to question, however, whether that would save you enough more to be worth the trouble. At least right off, my immediate guess is that you'd be doing a fair amount of extra work for fairly minimal savings (just storing the final bytes of the million integers obviously occupies 1 megabyte, and three levels of table leading to that would clearly occupy a fair amount more, so you'd double the number of levels to save something like half a megabyte. If you're in a situation where saving just a little more would make a big difference, go for it -- but otherwise, I doubt whether the return will justify the extra investment.

提交回复
热议问题