Flume-ng tail a file

后端 未结 4 1751
生来不讨喜
生来不讨喜 2021-01-07 09:10

I am trying to understand how to tail a file with flume-ng so that I can push the data into HDFS. In the first instance I have setup a simple conf file:

tail         


        
相关标签:
4条回答
  • 2021-01-07 09:28

    Your config file looks fine. I used it in CDH4 and worked as you expected, all I did was change the log file location for the tail. I saw the output on the console. In my case new log data was being written continuously to the file I was tailing. The timestamps in your data make it look like this might not be the case in your example.

    Here is a more complete conf example more in line with what I think you are trying to do. It will tail the file and write a new HDFS file every 10 min or 10K records. Change the agent1.sources.source1.command to your tail command and change the agent1.sinks.sink1.hdfs.path and agent1.sinks.sink1.hdfs.filePrefix based on your HDFS config.

    # A single-node Flume configuration
    # uses exec and tail and will write a file every 10K records or every 10 min
    # Name the components on this agent
    agent1.sources = source1
    agent1.sinks = sink1
    agent1.channels = channel1
    
    # Describe/configure source1
    agent1.sources.source1.type = exec
    agent1.sources.source1.command = tail -f /home/cloudera/LogCreator/fortune_log.log
    
    # Describe sink1
    agent1.sinks.sink1.type = hdfs
    agent1.sinks.sink1.hdfs.path = hdfs://localhost/flume/logtest/
    agent1.sinks.sink1.hdfs.filePrefix = LogCreateTest
    # Number of seconds to wait before rolling current file (0 = never roll based on time interval)
    agent1.sinks.sink1.hdfs.rollInterval = 600
    # File size to trigger roll, in bytes (0: never roll based on file size) 
    agent1.sinks.sink1.hdfs.rollSize = 0
    #Number of events written to file before it rolled (0 = never roll based on number of events) 
    agent1.sinks.sink1.hdfs.rollCount = 10000
    # number of events written to file before it flushed to HDFS 
    agent1.sinks.sink1.hdfs.batchSize = 10000 
    agent1.sinks.sink1.hdfs.txnEventMax = 40000
    # -- Compression codec. one of following : gzip, bzip2, lzo, snappy
    # hdfs.codeC = gzip
    #format: currently SequenceFile, DataStream or CompressedStream
    #(1)DataStream will not compress output file and please don't set codeC
    #(2)CompressedStream requires set hdfs.codeC with an available codeC
    agent1.sinks.sink1.hdfs.fileType = DataStream 
    agent1.sinks.sink1.hdfs.maxOpenFiles=50
    # -- "Text" or "Writable"
    #hdfs.writeFormat
    agent1.sinks.sink1.hdfs.appendTimeout = 10000
    agent1.sinks.sink1.hdfs.callTimeout = 10000
    # Number of threads per HDFS sink for HDFS IO ops (open, write, etc.)
    agent1.sinks.sink1.hdfs.threadsPoolSize=100 
    # Number of threads per HDFS sink for scheduling timed file rolling
    agent1.sinks.sink1.hdfs.rollTimerPoolSize = 1 
    # hdfs.kerberosPrin--cipal Kerberos user principal for accessing secure HDFS
    # hdfs.kerberosKey--tab Kerberos keytab for accessing secure HDFS
    # hdfs.round false Should the timestamp be rounded down (if true, affects all time based escape sequences except %t)
    # hdfs.roundValue1 Rounded down to the highest multiple of this (in the unit configured using
    # hdfs.roundUnit), less than current time.
    # hdfs.roundUnit second The unit of the round down value - second, minute or hour.
    # serializer TEXT Other possible options include AVRO_EVENT or the fully-qualified class name of an implementation of the EventSerializer.Builder interface.
    # serializer.*
    
    
    # Use a channel which buffers events to a file
    # -- The component type name, needs to be FILE.
    agent1.channels.channel1.type = FILE 
    # checkpointDir ~/.flume/file-channel/checkpoint The directory where checkpoint file will be stored
    # dataDirs ~/.flume/file-channel/data The directory where log files will be stored
    # The maximum size of transaction supported by the channel
    agent1.channels.channel1.transactionCapacity = 1000000 
    # Amount of time (in millis) between checkpoints
    agent1.channels.channel1.checkpointInterval 30000
    # Max size (in bytes) of a single log file 
    agent1.channels.channel1.maxFileSize = 2146435071
    # Maximum capacity of the channel 
    agent1.channels.channel1.capacity 10000000 
    #keep-alive 3 Amount of time (in sec) to wait for a put operation
    #write-timeout 3 Amount of time (in sec) to wait for a write operation
    
    # Bind the source and sink to the channel
    agent1.sources.source1.channels = channel1
    agent1.sinks.sink1.channel = channel1
    
    0 讨论(0)
  • 2021-01-07 09:30

    two possible causes : first, once you have issued the command and source gets started, the sink must get itself registered and started. i don't find these two lines in the logs you have shown. i hope you haven't missed it. normally it should look something like this :

    apache@hadoop:/hadoop/projects/apache-flume-1.4.0-SNAPSHOT-bin$ bin/flume-ng agent -n agent1 -c /conf -f conf/agent1.conf
    Info: Including Hadoop libraries found via (/hadoop/projects/hadoop-1.0.4/bin/hadoop) for HDFS access
    Warning: $HADOOP_HOME is deprecated.
    
    Warning: $HADOOP_HOME is deprecated.
    
    Info: Excluding /hadoop/projects/hadoop-1.0.4/libexec/../lib/slf4j-api-1.4.3.jar from classpath
    Info: Excluding /hadoop/projects/hadoop-1.0.4/libexec/../lib/slf4j-log4j12-1.4.3.jar from classpath
    + exec /usr/lib/jvm/java-7-oracle/bin/java -Xmx20m -cp '/conf:/hadoop/projects/apache-flume-1.4.0-SNAPSHOT-bin/lib/*:/hadoop/projects/hadoop-1.0.4/libexec/../conf:/usr/lib/jvm/java-7-oracle/lib/tools.jar:/hadoop/projects/hadoop-1.0.4/libexec/..:/hadoop/projects/hadoop-1.0.4/libexec/../hadoop-core-1.0.4.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/asm-3.2.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/aspectjrt-1.6.5.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/aspectjtools-1.6.5.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-beanutils-1.7.0.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-beanutils-core-1.8.0.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-cli-1.2.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-codec-1.4.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-collections-3.2.1.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-configuration-1.6.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-daemon-1.0.1.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-digester-1.8.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-el-1.0.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-httpclient-3.0.1.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-io-2.1.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-lang-2.4.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-logging-1.1.1.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-logging-api-1.0.4.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-math-2.1.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-net-1.4.1.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/core-3.1.1.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/guava-13.0.1.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/hadoop-capacity-scheduler-1.0.4.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/hadoop-fairscheduler-1.0.4.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/hadoop-thriftfs-1.0.4.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/hsqldb-1.8.0.10.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jackson-core-asl-1.8.8.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jackson-mapper-asl-1.8.8.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jasper-compiler-5.5.12.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jasper-runtime-5.5.12.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jdeb-0.8.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jersey-core-1.8.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jersey-json-1.8.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jersey-server-1.8.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jets3t-0.6.1.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jetty-6.1.26.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jetty-util-6.1.26.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jsch-0.1.42.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/junit-4.5.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/kfs-0.2.2.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/log4j-1.2.15.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/mockito-all-1.8.5.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/oro-2.0.8.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/protobuf-java-2.3.0.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/servlet-api-2.5-20081211.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/xmlenc-0.52.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/zookeeper-3.4.3.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jsp-2.1/jsp-2.1.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jsp-2.1/jsp-api-2.1.jar' -Djava.library.path=:/hadoop/projects/hadoop-1.0.4/libexec/../lib/native/Linux-amd64-64 org.apache.flume.node.Application -n agent1 -f conf/agent1.conf
    12/12/15 02:55:29 INFO node.PollingPropertiesFileConfigurationProvider: Configuration provider starting
    12/12/15 02:55:29 INFO node.PollingPropertiesFileConfigurationProvider: Reloading configuration file:conf/agent1.conf
    12/12/15 02:55:29 INFO conf.FlumeConfiguration: Processing:HDFS
    12/12/15 02:55:29 INFO conf.FlumeConfiguration: Processing:HDFS
    12/12/15 02:55:29 INFO conf.FlumeConfiguration: Processing:HDFS
    12/12/15 02:55:29 INFO conf.FlumeConfiguration: Processing:HDFS
    12/12/15 02:55:29 INFO conf.FlumeConfiguration: Added sinks: HDFS Agent: agent1
    12/12/15 02:55:29 INFO conf.FlumeConfiguration: Post-validation flume configuration contains configuration  for agents: [agent1]
    12/12/15 02:55:29 INFO node.AbstractConfigurationProvider: Creating channels
    12/12/15 02:55:29 INFO channel.DefaultChannelFactory: Creating instance of channel MemoryChannel-2 type memory
    12/12/15 02:55:29 INFO node.AbstractConfigurationProvider: Created channel MemoryChannel-2
    12/12/15 02:55:29 INFO source.DefaultSourceFactory: Creating instance of source tail, type exec
    12/12/15 02:55:29 INFO sink.DefaultSinkFactory: Creating instance of sink: HDFS, type: hdfs
    12/12/15 02:55:30 INFO hdfs.HDFSEventSink: Hadoop Security enabled: false
    12/12/15 02:55:30 INFO node.Application: Starting new configuration:{ sourceRunners:{tail=EventDrivenSourceRunner: { source:org.apache.flume.source.ExecSource{name:tail,state:IDLE} }} sinkRunners:{HDFS=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor@137efe53 counterGroup:{ name:null counters:{} } }} channels:{MemoryChannel-2=org.apache.flume.channel.MemoryChannel{name: MemoryChannel-2}} }
    12/12/15 02:55:30 INFO node.Application: Starting Channel MemoryChannel-2
    12/12/15 02:55:30 INFO instrumentation.MonitoredCounterGroup: Monitoried counter group for type: CHANNEL, name: MemoryChannel-2, registered successfully.
    12/12/15 02:55:30 INFO instrumentation.MonitoredCounterGroup: Component type: CHANNEL, name: MemoryChannel-2 started
    12/12/15 02:55:30 INFO node.Application: Starting Sink HDFS
    12/12/15 02:55:30 INFO node.Application: Starting Source tail
    12/12/15 02:55:30 INFO source.ExecSource: Exec source starting with command:tail -F /var/log/apache2/access.log.1
    12/12/15 02:55:30 INFO instrumentation.MonitoredCounterGroup: Monitoried counter group for type: SINK, name: HDFS, registered successfully.
    12/12/15 02:55:30 INFO instrumentation.MonitoredCounterGroup: Component type: SINK, name: HDFS started
    

    See the last 2 lines.

    second, agent won't push any data until something new comes to the file, here '/var/log/apache2/access.log'. either manually copy something to the file and restart you apache and do something, then check the contents of your /hdfs/flume directory.

    0 讨论(0)
  • 2021-01-07 09:30

    Because /var/log/apache2/access.log is not big enough to let flume print the lines of file So just try thisas following,you can find out the output in console

    for i in {1..100}; do echo "tail log test$i" >> var/log/apache2/access.log;done
    
    0 讨论(0)
  • 2021-01-07 09:43

    I suppose you can use taildir source if you are using Flume 1.7.0

    The following is what I used in my project:

    a1.sources.r1.type = TAILDIR
    a1.sources.r1.positionFile = /xxx/env/flume/taildir_position.json
    a1.sources.r1.filegroups = f1
    a1.sources.r1.filegroups.f1 = /xxx/logs/file_name.*.log
    a1.sources.r1.headers.f1.headerKey1 = yyy
    a1.sources.r1.fileHeader = true
    
    0 讨论(0)
提交回复
热议问题