Streaming data and Hadoop? (not Hadoop Streaming)

前端未结

关注

 10  1385

I\'d like to analyze a continuous stream of data (accessed over HTTP) using a MapReduce approach, so I\'ve been looking into Apache Hadoop. Unfortunately, it appears that Hadoop

相关标签:

10条回答

耶瑟儿～

2021-01-30 12:25

You should try Apache Spark Streaming. It should work well for your purposes.

0 讨论(0)
发布评论:

提交评论
- 加载中...
心在旅途

2021-01-30 12:28

Twitter's Storm is what you need, you can have a try!

0 讨论(0)
发布评论:

提交评论
- 加载中...
星月不相逢

2021-01-30 12:33

What about http://s4.io/. It's made for processing streaming data.

Update

A new product is rising: Storm - Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more

0 讨论(0)
发布评论:

提交评论
- 加载中...
没有蜡笔的小新

2021-01-30 12:35

Your use case sounds similar to the issue of writing a web crawler using Hadoop - the data streams back (slowly) from sockets opened to fetch remote pages via HTTP.

If so, then see Why fetching web pages doesn't map well to map-reduce. And you might want to check out the FetcherBuffer class in Bixo, which implements a threaded approach in a reducer (via Cascading) to solve this type of problem.

0 讨论(0)
发布评论:

提交评论
- 加载中...
说谎

2021-01-30 12:37

I think you should take a look over Esper CEP ( http://esper.codehaus.org/ ).

0 讨论(0)
发布评论:

提交评论
- 加载中...
别跟我提以往

2021-01-30 12:40

The hack you describe is more or less the standard way to do things -- Hadoop is fundamentally a batch-oriented system (for one thing, if there is no end to the data, Reducers can't ever start, as they must start after the map phase is finished).

Rotate your logs; as you rotate them out, dump them into HDFS. Have a watchdog process (possibly a distributed one, coordinated using ZooKeeper) monitor the dumping grounds and start up new processing jobs. You will want to make sure the jobs run on inputs large enough to warrant the overhead.

Hbase is a BigTable clone in the hadoop ecosystem that may be interesting to you, as it allows for a continuous stream of inserts; you will still need to run analytical queries in batch mode, however.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页

Streaming data and Hadoop? (not Hadoop Streaming)

Update