realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

前端 未结 5 662
不思量自难忘°
不思量自难忘° 2021-01-31 06:30

I have a solution that can be parallelized, but I don\'t (yet) have experience with hadoop/nosql, and I\'m not sure which solution is best for my needs. In theory, if I had unl

5条回答
  •  温柔的废话
    2021-01-31 07:10

    Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds

    HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.

    check out http://en.wikipedia.org/wiki/Standard_deviation

    stddev(X) = sqrt(E[X^2]- (E[X])^2)

    this implies that you can get the stddev of AB by doing

    sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)

提交回复
热议问题