Using an HDFS Sink and rollInterval in Flume-ng to batch up 90 seconds of log information

前端 未结 2 1733
轮回少年
轮回少年 2021-02-06 16:04

I am trying to use Flume-ng to grab 90 seconds of log information and put it into a file in HDFS. I have flume working to look at the log file via an exec and tail however it i

相关标签:
2条回答
  • 2021-02-06 16:38

    According to the source code of org.apache.flume.sink.hdfs.BucketWriter:

     /**
     * Internal API intended for HDFSSink use.
     * This class does file rolling and handles file formats and serialization.
     * Only the public methods in this class are thread safe.
     */
    class BucketWriter {
      ...
      /**
       * open() is called by append()
       * @throws IOException
       * @throws InterruptedException
       */
      private void open() throws IOException, InterruptedException {
        ...
        // if time-based rolling is enabled, schedule the roll
        if (rollInterval > 0) {
          Callable<Void> action = new Callable<Void>() {
            public Void call() throws Exception {
              LOG.debug("Rolling file ({}): Roll scheduled after {} sec elapsed.",
                  bucketPath, rollInterval);
              try {
                // Roll the file and remove reference from sfWriters map.
                close(true);
              } catch(Throwable t) {
                LOG.error("Unexpected error", t);
              }
              return null;
            }
          };
          timedRollFuture = timedRollerPool.schedule(action, rollInterval,
              TimeUnit.SECONDS);
        }
        ...
      }
      ...
       /**
       * check if time to rotate the file
       */
      private boolean shouldRotate() {
        boolean doRotate = false;
    
        if (writer.isUnderReplicated()) {
          this.isUnderReplicated = true;
          doRotate = true;
        } else {
          this.isUnderReplicated = false;
        }
    
        if ((rollCount > 0) && (rollCount <= eventCounter)) {
          LOG.debug("rolling: rollCount: {}, events: {}", rollCount, eventCounter);
          doRotate = true;
        }
    
        if ((rollSize > 0) && (rollSize <= processSize)) {
          LOG.debug("rolling: rollSize: {}, bytes: {}", rollSize, processSize);
          doRotate = true;
        }
    
        return doRotate;
      }
    ...
    }
    

    and org.apache.flume.sink.hdfs.AbstractHDFSWriter

    public abstract class AbstractHDFSWriter implements HDFSWriter {
    ...
      @Override
      public boolean isUnderReplicated() {
        try {
          int numBlocks = getNumCurrentReplicas();
          if (numBlocks == -1) {
            return false;
          }
          int desiredBlocks;
          if (configuredMinReplicas != null) {
            desiredBlocks = configuredMinReplicas;
          } else {
            desiredBlocks = getFsDesiredReplication();
          }
          return numBlocks < desiredBlocks;
        } catch (IllegalAccessException e) {
          logger.error("Unexpected error while checking replication factor", e);
        } catch (InvocationTargetException e) {
          logger.error("Unexpected error while checking replication factor", e);
        } catch (IllegalArgumentException e) {
          logger.error("Unexpected error while checking replication factor", e);
        }
        return false;
      }
    ...
    }
    

    the rolling of hdfs files is controlled by 4 conditions:

    1. hdfs.rollSize
    2. hdfs.rollCount
    3. hdfs.minBlockReplicas(highest priority, but usually not the reason causing rolling small file)
    4. hdfs.rollInterval

    Change the values accoding to these if-segments in BucketWriter.class

    0 讨论(0)
  • 2021-02-06 16:42

    A rewrite of the config file specifying a more complete selection of parameters did the trick. This example will write after 10k records or 10 min which ever comes first. In addition I changed from a memory channel to a file channel to aid in reliability on the data flow.

    agent1.sources = source1
    agent1.sinks = sink1
    agent1.channels = channel1
    
    # Describe/configure source1                                                                                                                                                                                                                 
    agent1.sources.source1.type = exec
    agent1.sources.source1.command = tail -f /home/cloudera/LogCreator/fortune_log.log
    
    # Describe sink1                                                                                                                                                                                                                             
    agent1.sinks.sink1.type = hdfs
    agent1.sinks.sink1.hdfs.path = hdfs://localhost/flume/logtest/
    agent1.sinks.sink1.hdfs.filePrefix = LogCreateTest
    # Number of seconds to wait before rolling current file (0 = never roll based on time interval)                                                                                                                                              
    agent1.sinks.sink1.hdfs.rollInterval = 600
    # File size to trigger roll, in bytes (0: never roll based on file size)                                                                                                                                                                     
    agent1.sinks.sink1.hdfs.rollSize = 0
    #Number of events written to file before it rolled (0 = never roll based on number of events)                                                                                                                                                
    agent1.sinks.sink1.hdfs.rollCount = 10000
    # number of events written to file before it flushed to HDFS                                                                                                                                                                                 
    agent1.sinks.sink1.hdfs.batchSize = 10000
    agent1.sinks.sink1.hdfs.txnEventMax = 40000
    # -- Compression codec. one of following : gzip, bzip2, lzo, snappy                                                                                                                                                                          
    # hdfs.codeC = gzip                                                                                                                                                                                                                          
    #format: currently SequenceFile, DataStream or CompressedStream                                                                                                                                                                              
    #(1)DataStream will not compress output file and please don't set codeC                                                                                                                                                                      
    #(2)CompressedStream requires set hdfs.codeC with an available codeC                                                                                                                                                                         
    agent1.sinks.sink1.hdfs.fileType = DataStream
    agent1.sinks.sink1.hdfs.maxOpenFiles=50
    # -- "Text" or "Writable"                                                                                                                                                                                                                    
    #hdfs.writeFormat                                                                                                                                                                                                                            
    agent1.sinks.sink1.hdfs.appendTimeout = 10000
    agent1.sinks.sink1.hdfs.callTimeout = 10000
    # Number of threads per HDFS sink for HDFS IO ops (open, write, etc.)                                                                                                                                                                        
    agent1.sinks.sink1.hdfs.threadsPoolSize=100
    # Number of threads per HDFS sink for scheduling timed file rolling                                                                                                                                                                          
    agent1.sinks.sink1.hdfs.rollTimerPoolSize = 1
    # hdfs.kerberosPrin--cipal Kerberos user principal for accessing secure HDFS                                                                                                                                                                 
    # hdfs.kerberosKey--tab Kerberos keytab for accessing secure HDFS                                                                                                                                                                            
    # hdfs.round false Should the timestamp be rounded down (if true, affects all time based escape sequences except %t)                                                                                                                         
    # hdfs.roundValue1 Rounded down to the highest multiple of this (in the unit configured using                                                                                                                                                
    # hdfs.roundUnit), less than current time.                                                                                                                                                                                                   
    # hdfs.roundUnit second The unit of the round down value - second, minute or hour.                                                                                                                                                           
    # serializer TEXT Other possible options include AVRO_EVENT or the fully-qualified class name of an implementation of the EventSerializer.Builder interface.                                                                                 
    # serializer.*                                                                                                                                                                                                                               
    
    
    # Use a channel which buffers events to a file                                                                                                                                                                                               
    # -- The component type name, needs to be FILE.                                                                                                                                                                                              
    agent1.channels.channel1.type = FILE
    # checkpointDir ~/.flume/file-channel/checkpoint The directory where checkpoint file will be stored                                                                                                                                          
    # dataDirs ~/.flume/file-channel/data The directory where log files will be stored                                                                                                                                                           
    # The maximum size of transaction supported by the channel                                                                                                                                                                                   
    agent1.channels.channel1.transactionCapacity = 1000000
    # Amount of time (in millis) between checkpoints                                                                                                                                                                                             
    agent1.channels.channel1.checkpointInterval 30000
    # Max size (in bytes) of a single log file                                                                                                                                                                                                   
    agent1.channels.channel1.maxFileSize = 2146435071
    # Maximum capacity of the channel                                                                                                                                                                                                            
    agent1.channels.channel1.capacity 10000000
    #keep-alive 3 Amount of time (in sec) to wait for a put operation                                                                                                                                                                            
    #write-timeout 3 Amount of time (in sec) to wait for a write operation                                                                                                                                                                       
    
    # Bind the source and sink to the channel                                                                                                                                                                                                    
    agent1.sources.source1.channels = channel1
    agent1.sinks.sink1.channel = channel1
    
    0 讨论(0)
提交回复
热议问题