Streaming data and Hadoop? (not Hadoop Streaming)

前端 未结 10 1370
别跟我提以往
别跟我提以往 2021-01-30 11:55

I\'d like to analyze a continuous stream of data (accessed over HTTP) using a MapReduce approach, so I\'ve been looking into Apache Hadoop. Unfortunately, it appears that Hadoop

相关标签:
10条回答
  • 2021-01-30 12:25

    You should try Apache Spark Streaming. It should work well for your purposes.

    0 讨论(0)
  • 2021-01-30 12:28

    Twitter's Storm is what you need, you can have a try!

    0 讨论(0)
  • 2021-01-30 12:33

    What about http://s4.io/. It's made for processing streaming data.

    Update

    A new product is rising: Storm - Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more

    0 讨论(0)
  • 2021-01-30 12:35

    Your use case sounds similar to the issue of writing a web crawler using Hadoop - the data streams back (slowly) from sockets opened to fetch remote pages via HTTP.

    If so, then see Why fetching web pages doesn't map well to map-reduce. And you might want to check out the FetcherBuffer class in Bixo, which implements a threaded approach in a reducer (via Cascading) to solve this type of problem.

    0 讨论(0)
  • 2021-01-30 12:37

    I think you should take a look over Esper CEP ( http://esper.codehaus.org/ ).

    0 讨论(0)
  • 2021-01-30 12:40

    The hack you describe is more or less the standard way to do things -- Hadoop is fundamentally a batch-oriented system (for one thing, if there is no end to the data, Reducers can't ever start, as they must start after the map phase is finished).

    Rotate your logs; as you rotate them out, dump them into HDFS. Have a watchdog process (possibly a distributed one, coordinated using ZooKeeper) monitor the dumping grounds and start up new processing jobs. You will want to make sure the jobs run on inputs large enough to warrant the overhead.

    Hbase is a BigTable clone in the hadoop ecosystem that may be interesting to you, as it allows for a continuous stream of inserts; you will still need to run analytical queries in batch mode, however.

    0 讨论(0)
提交回复
热议问题