问题
Originally, I had to deal with just 1.5[TB] of data. Since I just needed fast write/reads (without any SQL), I designed my own flat binary file format (implemented using python
) and easily (and happily) saved my data and manipulated it on one machine. Of course, for backup purposes, I added 2 machines to be used as exact mirrors (using rsync
).
Presently, my needs are growing, and there's a need to build a solution that would successfully scale up to 20[TB] (and even more) of data. I would be happy to continue using my flat file format for storage. It is fast, reliable and gives me everything I need.
The thing I am concerned about is replication, data consistency etc (as obviously, data will have to be distributed -- not all data
can be stored on one machine
) across the network.
Are there any ready-made
solutions (Linux / python based
) that would allow me to keep using my file format for storage, yet would handle the other components that NoSql
solutions normally provide? (data consistency / availability / easy replication)?
basically, all I want to make sure is that my binary files are consistent throughout my network. I am using a network of 60 core-duo machines (each with 1GB RAM
and 1.5TB disk
)
回答1:
Approach: Distributed Map reduce in Python with The Disco Project
Seems like a good way of approaching your problem. I have used the disco project with similar problems.
You can distribute your files among n numbers of machines (processes), and implement the map and reduce functions that fit your logic.
The tutorial of the disco project, exactly describes how to implement a solution for your problems. You'll be impressed about how little code you need to write and definitely you can keep the format of your binary file.
Another similar option is to use Amazon's Elastic MapReduce
回答2:
Perhaps some of the commentary on the Kivaloo system developed for Tarsnap will help you decide what's most appropriate: http://www.daemonology.net/blog/2011-03-28-kivaloo-data-store.html
Without knowing more about your application (size/type of records, frequency of reading/writing) or custom format it's hard to say more.
来源:https://stackoverflow.com/questions/5560523/nosql-with-my-own-custom-binary-files