real time log processing using apache spark streaming

后端 未结 3 550
情书的邮戳
情书的邮戳 2021-02-06 02:05

I want to create a system where I can read logs in real time, and use apache spark to process it. I am confused if I should use something like kafka or flume to pass the logs to

相关标签:
3条回答
  • 2021-02-06 02:42

    You can use Apache Kafka as queue system for your logs. The system that generated your logs e.g websever will send logs to Apache KAFKA. Then you can use apache storm or spark streaming library to read from KAFKA topic and process logs at real time.

    You need to create stream of logs , which you can create using Apache Kakfa. There are integration available for kafka with storm and apache spark. both has its pros and cons.

    For Storm Kafka Integration look here

    For Apache Spark Kafka Integration take a look here

    0 讨论(0)
  • 2021-02-06 02:52

    Apache Flume may help to read the logs in real time. Flume provides logs collection and transport to the application where Spark Streaming is used to analyze required information.

    1. Download Apache Flume from official site or follow the instructions from here

    2. Setup and run Flume modify flume-conf.properties.template from the directory where Flume is installed (FLUME_INSTALLATION_PATH\conf), here you need to provide logs source, channel and sinks (output). More details about setup here

    There is an example of launching flume which collects log information from ping comand running on windows host and writes it to a file:

    flume-conf.properties

    agent.sources = seqGenSrc
    agent.channels = memoryChannel
    agent.sinks = loggerSink
    
    agent.sources.seqGenSrc.type = exec
    agent.sources.seqGenSrc.shell = powershell -Command
    
    agent.sources.seqGenSrc.command = for() { ping google.com }
    
    agent.sources.seqGenSrc.channels = memoryChannel
    
    agent.sinks.loggerSink.type = file_roll
    
    agent.sinks.loggerSink.channel = memoryChannel
    agent.sinks.loggerSink.sink.directory = D:\\TMP\\flu\\
    agent.sinks.loggerSink.serializer = text
    agent.sinks.loggerSink.appendNewline = false
    agent.sinks.loggerSink.rollInterval = 0
    
    agent.channels.memoryChannel.type = memory
    agent.channels.memoryChannel.capacity = 100
    

    To run the example go to FLUME_INSTALLATION_PATH and execute

    java -Xmx20m -Dlog4j.configuration=file:///%CD%\conf\log4j.properties -cp .\lib\* org.apache.flume.node.Application -f conf\flume-conf.properties -n agent
    

    OR you may create your java application that has flume libraries in a classpath and call org.apache.flume.node.Application instance from the application passing corresponding arguments.

    How to setup Flume to collect and transport logs?

    You can use some script for gathering logs from the specified location

    agent.sources.seqGenSrc.shell = powershell -Command
    agent.sources.seqGenSrc.command = your script here
    

    instead of windows script you also can launch java application (put 'java path_to_main_class arguments' in field) which provides smart logs collection. For example, if the file is modified in real-time you can use Tailer from Apache Commons IO. To configure the Flume to transport the log infromation read this article

    3. Get the Flume stream from your source code and analyze it with Spark. Take a look on a code sample from github https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaFlumeEventCount.java

    0 讨论(0)
  • 2021-02-06 03:07

    Although this is a old question, posting a link from Databricks, which has a great step by step article for log analysis with Spark considering many areas.

    https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/index.html

    Hope this helps.

    0 讨论(0)
提交回复
热议问题