I am considering a Proof of concept for handling large volumes of data like > 10 G which requires atleast 200+ writes per second and about 50+ reads per second of spatial related data. This is a growing system as well. Currently I am considering moving this big volume data into a NoSql big table kind of db for performance reasons.
I have considered and taken some closer look at MongoDB and cassandra. As far as my reading goes,
Mongodb: - seems to have a writer lock problem - one of the posts in stackoverflow suggested this db if there is no need for multiple servers - indexes kept on memory. So the bigger the index growth, the performance is said to deteriorate - advantage is Mongodb has direct support for spatial data & indexing along with features like finding nearby locations etc., - I see this post Cassandra Or MongoDB For Our Location Based Application suggesting mongodb as the best choice
Cassandra:
- Seems to be the best of among the related dbs
- Seems to have great write as well as read performance
- Does not natively support spatial indexing but this can be extended via geohashing
My heart actually goes out for mongodb because of its good documentation and direct support for spatial data. Has any body had bad experience using mongodb for such big systems? I actually see lot of posts on mongodb iostat for performance.
If mongodb is not suited, can someone give some pointers on geohashing using cassandra? I saw the link http://code.google.com/p/geospatialweb/ for creating the hashes. But there are questions on how to query etc.?
I realize this is an older question and I know that it doesn't directly answer your question, but depending on your queries, Cassandra may not be the best option, And getting your queries to work with indexing in MongoDB can be problematic as well (in my own experience). Mongo has a slight edge over Cassandra for heavy geo data and queries imho.
I'd suggest also consider looking into ElasticSearch, which depending on your data shape and the types of queries you'll be making is probably the best solution. When you posted your question it was likely less of an option than today though.
Try Cassandra + Solr. This might be useful: http://digbigdata.com/geospatial-search-cassandra-datastax-enterprise/
Regards, Goutham Kumar
tl;dr
Elassandra a combination out of Cassandra and ElasticSearch.
A little update from the future.
I'm currently on creating a concept for a Big Data Real-time system and also need to store geospatial data and do queries at scale. The last days I did a lot of research how to arrange the data properly and be able to support a geospatial index and queries like a bounding box.
The first I read about was PostgreSQL + Postgis but the biggest instance is limited to max 200k writes/sec.
The second was a geospatial database, Tile38, which is able to scale queries but not the writes. The only way with this would be to shard the data manually.
The third was MongoDB because there you can find a good documentation supporting the geospatial functionality I need, but it was hard to decide, if you are able to scale the writes.
So the last database was Cassandra. This database is well known for the horizontal write scaling and failure-takeover. The trade-off with Cassandra is, that querying the data has not good performance and does not support geo spatial out of the box. For querying the data at scale ElasticSearch is a good solution, as Tracker1 already suggested. Today I found a new database made up of Cassandra and ElasticSearch, called Elassandra which allows writes at scale and also reading data at scale in near-realtime. So far for me the best solution, with a minimum effort for setup and maintenance.
We also use Cassandra at the moment and look for a spatial index solution. We go with Lucene in order to provide full text and attributed search and along with it comes support for spartial indexing. Maybe you want to check this, too.
Our current implementation looks like sharding the information based on a simple tree (grid based) and each shard is a Lucene index and once it grows over a certain size the index is split by either x or y. And since such a shard has a binary representation (position in the grid consists of two bits, next level next 2 bits and so on), a search is issued by the position and will be answered by any shard hat prefix the position / grid resolution. Simple system works good so far but is not in use productively at the moment.
来源:https://stackoverflow.com/questions/7903712/spatial-data-with-mongodb-or-cassandra