问题
I am trying to understand what should drive the choice of the access method while using a BerkeleyDB : B-Tree versus HashTable. A Hashtable provides O(1) lookup but inserts are expensive (using Linear/Extensible hashing we get amortized O(1) for insert). But B-Trees provide log N (base B) lookup and insert times. A B-Tree can also support range queries and allow access in sorted order.
- Apart from these considerations what else should be factored in?
- If I don't need to support range queries can I just use a Hashtable access method?
回答1:
When your data sets get very large, B-trees are still better because the majority of the internal metadata may still fit in cache. Hashes, by their nature (uniform random distribution of data) are inherently cache-unfriendly. I.e., once the total size of the data set exceeds the working memory size, hash performance drops off a cliff while B-tree performance degrades gracefully (logarithmically, actually).
回答2:
It depends on your data set and keys On small data sets your benchmark will be close to the same, however on larger data sets it can vary depending on what type of keys / how much data you have. Usually b-tree is better, until the btree meta data exceeds your cache and it ends up doing lots of io, in that case hash is better. Also as you pointed out, b-tree inserts are more expensive, if you find you will be doing lots of inserts and few reads, hash may be better, if you find you do little inserts, but lots of reads, b-tree may be better.
If you are really concerned about performance you could try both methods and run your own benchmarks =]
回答3:
For many applications, a database is accessed at random, interactively or with "transactions". This might happen if you have data coming in from a web server. However, you often have to populate a large database all at once, as a "batch" operation. This might happen if you are doing a data analysis project, or migrating an old database to a new format.
When you are populating a database all at once, a B-Tree or other sorted index is preferable because it allows the batch insertions to be done much more efficiently. This is accomplished by sorting the keys before putting them into the database. Populating a BerkeleyDB database with 10 million entries might take an hour when the entries are unsorted, because every access is a cache miss. But when the entries are sorted, the same procedure might take only ten minutes. The proximity of consecutive keys means you'll be utilizing various caches for almost all of the insertions. Sorting can be done very quickly, so the whole operation could be sped up by several times just by sorting the data before inserting it. With hashtable indexing, because you don't know in advance which keys will end up next to each other, this optimization is not possible.
Update: I decided to provide an actual example. It is based on the
following script "db-test
"
#!/usr/bin/perl
use warnings;
use strict;
use BerkeleyDB;
my %hash;
unlink "test.db";
tie %hash, (shift), -Filename=>"test.db", -Flags=>DB_CREATE or die;
while(<>) { $hash{$_}=1; }
untie %hash;
We can test it with a Wikipedia dump index file of 16 million entries. (I'm running this on an 800MHz 2-core laptop, with 3G of memory)
$ >enw.tab bunzip2 <enwiki-20151102-pages-articles-multistream-index.txt.bz2
$ wc -l enw.tab
16050432 enw.tab
$ du -shL enw.tab
698M enw.tab
$ time shuf enw.tab > test-shuf
16.05s user 6.65s system 67% cpu 33.604 total
$ time sort enw.tab > test-sort
70.99s user 10.77s system 114% cpu 1:11.47 total
$ time ./db-test BerkeleyDB::Btree < test-shuf
682.75s user 368.58s system 42% cpu 40:57.92 total
$ du -sh test.db
1.3G test.db
$ time ./db-test BerkeleyDB::Btree < test-sort
378.10s user 10.55s system 91% cpu 7:03.34 total
$ du -sh test.db
923M test.db
$ time ./db-test BerkeleyDB::Hash < test-shuf
672.21s user 387.18s system 39% cpu 44:11.73 total
$ du -sh test.db
1.1G test.db
$ time ./db-test BerkeleyDB::Hash < test-sort
665.94s user 376.65s system 36% cpu 46:58.66 total
$ du -sh test.db
1.1G test.db
You can see that pre-sorting the Btree keys drops the insertion time down from 41 minutes to 7 minutes. Sorting takes only 1 minute, so there's a big net gain - the database creation goes 5x faster. With the Hash format, the creation times are equally slow whether the input is sorted or not. Also note that the database file size is smaller for the sorted insertions; presumably this has to do with tree balancing.
The speedup must be due to some kind of caching, but I'm not sure where. It is likely that we have fewer cache misses in the kernel's page cache with sorted insertions. This would be consistent with the CPU usage numbers - when there is a page cache miss, then the process has to wait while the page is retrieved from disk, so the CPU usage is lower.
I ran the same tests with two smaller files as well, for comparison.
File | WP index | Wikt. words | /usr/share/dict/words
Entries | 16e6 | 4.7e6 | 1.2e5
Size | 700M | 65M | 1.1M
shuf time | 34s | 4s | 0.06s
sort time | 1:10s | 6s | 0.12s
-------------------------------------------------------------------------
| total DB CPU | |
| time size usage| |
-------------------------------------------------------------------------
Btree shuf | 41m, 1.3G, 42% | 5:00s, 180M, 88% | 6.4s, 3.9M, 86%
sort | 7m, 920M, 91% | 1:50s, 120M, 99% | 2.9s, 2.6M, 97%
Hash shuf | 44m, 1.1G, 39% | 5:30s, 129M, 87% | 6.2s, 2.4M, 98%
sort | 47m, 1.1G, 36% | 5:30s, 129M, 86% | 6.2s, 2.4M, 94%
-------------------------------------------------------------------------
Speedup | 5x | 2.7x | 2.2x
With the largest dataset, sorted insertions give us a 5x speedup. With the smallest, we still get a 2x speedup - even though the data fits easily into memory, so that CPU usage is always high. This seems to imply that we are benefiting from another source of efficiency in addition to the page cache, and that the 5x speedup was actually due in equal parts to page cache and something else - perhaps the better tree balancing?
In any case, I tend to prefer the Btree format because it allows faster batch operations. Even if the final database is accessed at random, I use batch operations for development, testing, and maintenance. Life is easier if I can find a way to speed these up.
回答4:
To quote the two main authors of Berkeley DB in this write up of the architecture:
The main difference between Btree and Hash access methods is that Btree offers locality of reference for keys, while Hash does not. This implies that Btree is the right access method for almost all data sets; however, the Hash access method is appropriate for data sets so large that not even the Btree indexing structures fit into memory. At that point, it's better to use the memory for data than for indexing structures. This trade-off made a lot more sense in 1990 when main memory was typically much smaller than today.
So perhaps in embedded devices and specialized use cases a hash table may work. BTree is used in modern filesystems like Btrfs and it is pretty much the idea data structure for building either databases or filesystems.
来源:https://stackoverflow.com/questions/4129326/berkeleydb-b-tree-versus-hash-table