realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

前端未结

关注

 5  664

I have a solution that can be parallelized, but I don\'t (yet) have experience with hadoop/nosql, and I\'m not sure which solution is best for my needs. In theory, if I had unl

相关标签:

5条回答

Happy的楠姐

2021-01-31 07:05

You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.

0 讨论(0)
发布评论:

提交评论
- 加载中...
温柔的废话

2021-01-31 07:10

Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds

HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.

check out http://en.wikipedia.org/wiki/Standard_deviation

stddev(X) = sqrt(E[X^2]- (E[X])^2)

this implies that you can get the stddev of AB by doing

sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)

0 讨论(0)
发布评论:

提交评论
- 加载中...
萌比男神i

2021-01-31 07:21

If I understand you correctly and you only need to aggregate on single columns at a time You can store your data differently for better results in HBase that would look something like table per data column in today's setup and another single table for the filtering fields (type_ids) row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read ( column for each table in today's setup (i.e. few thousands of columns) HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist. When you read a row you'd get all the relevant value which you can do avg. etc. quite easily

0 讨论(0)
发布评论:

提交评论
- 加载中...
独厮守ぢ

2021-01-31 07:22

Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.

0 讨论(0)
发布评论:

提交评论
- 加载中...
醉梦人生

2021-01-31 07:23

It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do. Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time... Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.

0 讨论(0)
发布评论:

提交评论
- 加载中...