问题
I have a crowdsourcing application. data from users is collected and then processed and then updated for everyone to see. The data collection is almost real time. The processing speed is increasing as the users (data nodes) are increasing. I need to scale this.
Looking at scaling for graph based models, mapreduce seems to be famous. Is there a benchmarking paper comparing it to other techniques? Pregel is impressive. Please point me to any leads about 'partitioning' in pregel i.e, how a graph can be partitioned intelligently so as to minimize processes lagging behind each other.
回答1:
The problem of partitioning a graph 'intelligently' in order to minimize execution time is an interesting one, however it's not simple and it depends on your data and your algorithm. You might find also that, in practice, it's not necessary and a random partitioning is sufficiently good.
For example, if you are interested in exploring Pregel-like approaches, you can have a look at Apache Giraph and experiment with different partitioning techniques.
来源:https://stackoverflow.com/questions/9583296/how-to-partition-graph-for-pregel-to-maximize-processing-speed