Fast disk-based hashtables?

前端未结

关注

 6  737

I have sets of hashes (first 64 bits of MD5, so they\'re distributed very randomly) and I want to be able to see if a new hash is in a set, and to add it to a set.

相关标签:

6条回答

遥遥无期

2020-12-04 11:55

I had some trouble picturing your exact problem/need, but it still got me thinking about Git and how it stores SHA1-references on disk:

Take the hexadecimal string representation of a given hash, say, "abfab0da6f4ebc23cb15e04ff500ed54". Chop the two first characters in the hash ("ab", in our case) and make it into a directory. Then, use the rest ("fab0da6f4ebc23cb15e04ff500ed54"), create the file, and put stuff in it.

This way, you get pretty decent performance on-disk (depending on your FS, naturally) with an automatic indexing. Additionally, you get direct access to any known hash, just by wedging a directory delimiter after the two first chars ("./ab/fab0da[..]")

I'm sorry if I missed the ball entirely, but with any luck, this might give you an idea.

0 讨论(0)
发布评论:

提交评论
- 加载中...
深忆病人

2020-12-04 12:02
Here's the solution I eventually used:
- One file per set
- File contains 2^k buckets, each 256 bytes or 32 entries of 8 bytes
- Empty entries are just zeroed out (000... is a valid hash, but I don't care about 2^-64 chance of collision, if everything can collide with everything else already, by the nature of hashing).
- Every hash resides in bucket guessed via its first k bits
- If any bucket overflows, double file size and split every bucket
- Everything is accessed via mmap(), not read()/write()
It's just unbelievably faster than sqlite, even though it's low-level Perl code, and Perl really isn't meant for high performance databases. It will not work with anything that's less uniformly distributed than MD5, its assuming everything will be extremely uniform to keep the implementation simple.

I tried it with seek()/sysread()/syswrite() at first, and it was very slow, mmap() version is really a lot faster.
0 讨论(0)
发布评论:

提交评论
- 加载中...
攒了一身酷

2020-12-04 12:04

Sounds like a job for Berkeley DB.

0 讨论(0)
发布评论:

提交评论
- 加载中...
粉色の甜心

2020-12-04 12:10

Other disk-based hashing algos/data structures include linear hashing and extensible hashing.

0 讨论(0)
发布评论:

提交评论
- 加载中...
小鲜肉

2020-12-04 12:11

Since for a hash you have to use random access, I doubt any database will give you decent performance. Your best bet might be to up the disc cache (more RAM), and get harddisks with a very high random access speed (maybe solid state disks).

0 讨论(0)
发布评论:

提交评论
- 加载中...
青春惊慌失措

2020-12-04 12:16
Two algorithms come to my mind at first:
- Use a b-tree.
- Separate-chain the hashes themselves by doing something like using the first 10 bits of your hash to index into one of 1024 individual files, each of which contains a sorted list of all the hashes starting with those 10 bits. That gives you a constant-time jump into a block that ought to fit into memory, and a log(n) search once you've loaded that block. (or you could use 8 bits to hash into 256 files, etc.)
0 讨论(0)
发布评论:

提交评论
- 加载中...