Amazon MapReduce best practices for logs analysis

后端 未结 1 800
太阳男子
太阳男子 2021-02-08 20:16

I\'m parsing access logs generated by Apache, Nginx, Darwin (video streaming server) and aggregating statistics for each delivered file by date / referrer / useragent.

相关标签:
1条回答
  • 2021-02-08 20:34

    That's a very very wide open question, but here are some thoughts you could consider:

    • Using Amazon SQS: this is a distributed queue, and is very useful for workflow management, you cna have a process that writes to the queue as soon as a log is available, and another who reads from it, processes the log described in the queue message, and deletes it when it's done processing. This would ensure that logs are processed only once.
    • Apache Flume as you mentionned is very useful for log aggregation. This is something you should consider, even if you don't need real-time, as it gives you at the very least a standardized aggregation process.
    • Amazon recently release SimpleWorkFlow. I have just started looking into it, but that sounds promising to manage every step of your data pipeline.

    Hope that gives you some clues.

    0 讨论(0)
提交回复
热议问题