I am looking to write a Key/value store (probably in python) mostly just for experience, and because it\'s something I think that is a very useful product. I have a couple of qu
Amazon released a document about Dynamo - a highly available key-value storage system. It mostly deals with the scaling issues (how to create a key/value store that runs on a large number of machines), but it also deals with some basics, and generally worth to read.
First up, I know this question quite old.
I'm the creator of aodbm ( http://sf.net/projects/aodbm/ ), which is a key-value store library. aodbm uses immutable B+Trees to store your data. So whenever a modification is made a new tree is appended to the end of the file. This probably sounds like a horrendous waste of space, but seeing as the vast majority of nodes from the previous tree are referenced, the overhead is actually quite low. Very little of an entire tree is kept in memory at any given time (at most O(log n)).
I recommend to see the talk Write optimization in external - memory data structures (slides), which gives a nice overview of modern approaches to building extra-memory databases (e. g. key-value stores), and explains log-structured merge trees.
If your key-value store targets use-cases when all the data fits main memory, the data store architecture could be a lot simpler, mapping a file to a big chunk of memory and working with that memory without bothering about disk-to-memory communication and synchronization at all, because it becomes a concern of the operating system.
If you're doing a key/value store in Python for learning purposes, it might be easiest to start with the pickle module. It's a fast and convenient way to write an arbitrary Python data stream to a persistent store and read it back again.
It all depends on the level of complexity you want to dive into. Starting with a simple Python dict
serialized to a file in a myriad of possible ways (of which pickle is probably the simplest), you can go as far as implementing a complete database system.
Look up redis
- it's a key/value store written in C and operating as a server "DB". It has some good documentation and easy to read code, so you can borrow ideas for your Python implementation.
To go even farther, you can read about B-trees.
For your specific questions: above some DB size, you can never keep it all in memory, so you need some robust way of loading data from disk. Also consider whether the store is single-client or multi-client. This has serious consequences for its implementation.
you can have a look at 'Berkley db' to see how how it works, it is a key/value DB, so you can use it directly, or as it is open-source see how it handles persistence, transactions and paging of most referred pages.
here are python bindings to it http://www.jcea.es/programacion/pybsddb.htm