Java: fast disk-based hash set

问题

I need to store a big hash set, able to contain up to approx 200 millions 40 bit values. Storing it as 200 millions 64 bit value would be acceptable (despite the 200 millions * 16 bits loss).

The requirements are:

tiny memory footprint (disk space ain't an issue, memory is)
fast contains(long l) and add(long l) methods (much faster than SQL)
embedded
free and without nasty licensing (no Berkeley DB). LGPL fine.
no false positive and no false negative, so things like disk-based Bloom Filters are not what I'm after

SQL is not what I'm after here .

Because I really think I'm more after something fast like this (notice how the solution is much faster than a SQL solution):

Fast disk-based hashtables?

Does Google have such a Java API?

Would a fast disk-based key/value pair implementation where I'd only use the 'key' work?

Or something else?

I'd rather not reinvent the weel.

回答1:

If you can afford 128 GB of disk, you could store one bit per 40 bit value. You could then use a random access file to check a bit is set or to change it. You wouldn't have to insert any values or maintain an index.

回答2:

Try sg-cdb (strange gizmo port of djb's cdb), and then swap out the RandomAccessFile with a BufferedRandomAccessFile (there's a good one in the jai-imageio code).

I'm getting mind blowing write performance. Through the roof (cuz it's all buffered then committed in one fell swoop). Read performance for the buffered RAF is not changed, though.

I could take the time to compare with H2 bulk import, might be comparable though not sure.

The database is simple: key => value. Lookup only supported on key. The keys are in my case (base32 encoded random ints + base32 encoded unique int) so the locality shouldn't really be helping much. The values are arrays of 120 random bytes.

loads (sql insert)

h2, with 131MB cache (including flush, not startup):

May 4, 2011 11:45:10 PM test.TestH2Simple main : inserts performed added in:101625 ms

database size: About 140 MB

batch size: 2000 : inserts performed added in:116875 ms

batch size: 10000 : insertsperformed added in:70234 ms

compare with sg-cdb (strange gizmo) port of cdb:

with RandomAccessFile: insert without flush:21688 ms, flush:30359 ms, total:52047 ms size of file on disk: 66.1 MB (69,315,632 bytes)

with BufferedRandomAccessFile: about 6.5 seconds

of course, it's not really a fair comparison, since H2 is flushing data continuously during the insert, whereas the sg-cdb is not. This has to be born in mind when performing the comparison. Probably fair would be to compare sg-cdb insert with H2 bulk insert

reads (sql select)

sg-cdb

: searched:490000 finished on:1304544550439 took:17547 ms, counter:0

H2

: selects performed in:19579 ms

Concerning Memory Mapped Files: they don't seem to be what you're looking for. The great performance with MMap files is when you map about 100MB or more into memory (my experience).

回答3:

I believe you will need to use a B-tree rather than a hash table. Hash tables do not have good locality for secondary storage, so you will lose too much time to disk I/O.

Most databases -- relational or not -- implement their indexes as a B-tree, so you are talking about the equivalent of storing an index with no other data attached to it.

Will you have multiple processes concurrently updating this data store?

来源：https://stackoverflow.com/questions/2346722/java-fast-disk-based-hash-set

标签

java

hashset

disk-based