Streaming data and Hadoop? (not Hadoop Streaming)

前端 未结 10 1371
别跟我提以往
别跟我提以往 2021-01-30 11:55

I\'d like to analyze a continuous stream of data (accessed over HTTP) using a MapReduce approach, so I\'ve been looking into Apache Hadoop. Unfortunately, it appears that Hadoop

相关标签:
10条回答
  • 2021-01-30 12:43

    Multiple options here. I suggest the combination of Kafka and Storm + (Hadoop or NoSql) as the solution. We already build our big data platform using those opensource tools, and it works very well.

    0 讨论(0)
  • 2021-01-30 12:43

    Several mature stream processing frameworks and products are available on the market. Open source frameworks are e.g. Apache Storm or Apache Spark (which can both run on top of Hadoop). You can also use products such as IBM InfoSphere Streams or TIBCO StreamBase.

    Take a look at this InfoQ article, which explains stream processing and all these frameworks and products in detail: Real Time Stream Processing / Streaming Analytics in Combination with Hadoop. Besides the article also explains how this is complementary to Hadoop.

    By the way: Many software vendors such as Oracle or TIBCO call this stream processing / streaming analytics approach "fast data" instead of "big data" as you have to act in real time instead of batch processing.

    0 讨论(0)
  • 2021-01-30 12:46

    Yahoo S4 http://s4.io/

    It provide real time stream computing, like map reduce

    0 讨论(0)
  • 2021-01-30 12:46

    As you know the main issues with Hadoop for usage in stream mining are the fact that first, it uses HFDS which is a disk and disk operations bring latency that will result in missing data in stream. second, is that the pipeline is not parallel. Map-reduce generally operates on batches of data and not instances as it is with stream data.

    I recently read an article about M3 which tackles the first issue apparently by bypassing HDFS and perform in-memory computations in objects database. And for the second issue, they are using incremental learners which are not anymore performed in batch. Worth checking it out M3 : Stream Processing on Main-Memory MapReduce. I could not find the source code or API of this M3 anywhere, if somebody found it please share the link here.

    Also, Hadoop Online is also another prototype that attemps to solve the same issues as M3 does: Hadoop Online

    However, Apache Storm is the key solution to the issue, however it is not enough. You need some euqivalent of map-reduce right, here is why you need a library called SAMOA which actually has great algorithms for online learning that mahout kinda lacks.

    0 讨论(0)
提交回复
热议问题