A space efficient data structure to store and look-up through a large set of (uniformly distributed) Integers

后端 未结 7 2162
悲哀的现实
悲哀的现实 2021-01-06 22:28

I\'m required to hold, in memory, and look-up through one million uniformly distributed integers. My workload is extremely look-up intensive.
My current implementation u

相关标签:
7条回答
  • 2021-01-06 22:44

    Sounds like you could just keep a sorted int[] and then do a binary search. With a million values, that's ~20 comparisons to get to any value - would that be fast enough?

    0 讨论(0)
  • 2021-01-06 22:53

    While Jon Skeet's answer gives good savings for a small investment, I think you can do better. Since your numbers are fairly even distributed, you can use an interpolating search for faster lookups (roughly O(log log N) instead of O(log N)). For a million items, you can probably plan on around 4 comparisons instead of around 20.

    If you want to do just a little more work to cut the memory (roughly) in half again, you could build it as a two-level lookup table, basically a sort of simple version of a trie.

    enter image description here

    You'd break your (presumably) 32-bit integer into two 16-bit pieces. You'd use the first 16 bits as an index into the first level of the lookup table. At this level, you'd have 65536 pointers, one for each possible 16-bit value for that part of your integer. That would take you to the second level of the table. For this part, we'd do a binary or interpolation search between the chosen pointer, and the next one up -- i.e., all the values in the second level that had that same value in the first 16 bits.

    When we look in the second table, however, we already know 16 bits of the value -- so instead of storing all 32 bits of the value, we only have to store the other 16 bits of the value.

    That means instead of the second level occupying 4 megabytes, we've reduced it to 2 megabytes. Along with that we need the first level table, but it's only 65536x4=256K bytes.

    This will almost certainly improve speed over a binary search of the entire data set. In the worst case (using a binary search for the second level) we could have as many as 17 comparisons (1 + log2 65536). The average will be better than that though -- since we have only a million items, there can only be an average of 1_000_000/65536 = ~15 items in each second-level "partition", giving approximately 1 + log2(16) = 5 comparisons. Using an interpolating search at the second level might reduce that a little further, but when you're only starting with 5 comparisons, you don't have much room left for really dramatic improvements. Given an average of only ~15 items at the second level, the type of search you use won't make much difference -- even a linear search is going to be pretty fast.

    Of course, if you wanted to you could go a step further and use a 4-level table instead (one for each byte in the integer). It may be open to question, however, whether that would save you enough more to be worth the trouble. At least right off, my immediate guess is that you'd be doing a fair amount of extra work for fairly minimal savings (just storing the final bytes of the million integers obviously occupies 1 megabyte, and three levels of table leading to that would clearly occupy a fair amount more, so you'd double the number of levels to save something like half a megabyte. If you're in a situation where saving just a little more would make a big difference, go for it -- but otherwise, I doubt whether the return will justify the extra investment.

    0 讨论(0)
  • 2021-01-06 22:55

    If you are willing to accept a small chance of a false positive in return for a large reduction in memory usage, then a Bloom filter may be just what you need.

    A Bloom filter consists of k hash functions and a table of n bits, initially empty. To add an item to the table, feed it to each of the k hash functions (getting a number between 0 and n−1) and set the corresponding bit. To check if an item is in the table, feed it to each of the k hash functions and see if all corresponding k bits are set.

    A Bloom filter with a 1% false positive rate requires about 10 bits per item; the false positive rate decreases rapidly as you add more bits per item.

    Here's an open-source implementation in Java.

    0 讨论(0)
  • 2021-01-06 22:59

    There are some Java implementation of Sets for Integers with reduced memory consumption in the Github project LargeIntegerSet.

    0 讨论(0)
  • 2021-01-06 23:02

    You might want to take a look at a BitSet The one used in Lucene is even faster as the standard Java implementation since it neglects some standard boundary checks.

    0 讨论(0)
  • 2021-01-06 23:02

    There are some IntHashSet implementations for primitives available.

    Quick googling got me this one. There is also an apache [open source] implementation of IntHashSet. I'd prefer the apache implementation, though it has some overhead [it is implemented as a IntToIntMap]

    0 讨论(0)
提交回复
热议问题