Filtering log files in Flume using interceptors

前端 未结 2 1942
醉话见心
醉话见心 2020-12-30 17:17

I have an http server writing log files which I then load into HDFS using Flume First I want to filter data according to data I have in my header or body. I read that I can

相关标签:
2条回答
  • 2020-12-30 17:29

    You don't need to write Java code to filter events. Use Regex Filtering Interceptor to filter events which body text matches some regular expression:

    agent.sources.logs_source.interceptors = regex_filter_interceptor
    agent.sources.logs_source.interceptors.regex_filter_interceptor.type = regex_filter
    agent.sources.logs_source.interceptors.regex_filter_interceptor.regex = <your regex>
    agent.sources.logs_source.interceptors.regex_filter_interceptor.excludeEvents = true
    

    To route events based on headers use Multiplexing Channel Selector:

    a1.sources = r1
    a1.channels = c1 c2 c3 c4
    a1.sources.r1.selector.type = multiplexing
    a1.sources.r1.selector.header = state
    a1.sources.r1.selector.mapping.CZ = c1
    a1.sources.r1.selector.mapping.US = c2 c3
    a1.sources.r1.selector.default = c4
    

    Here events with header "state"="CZ" go to channel "c1", with "state"="US" - to "c2" and "c3", all other - to "c4".

    This way you can also filter events by header - just route specific header value to channel, which points to Null Sink.

    0 讨论(0)
  • 2020-12-30 17:34

    You can use flume channel selectors for simply routing event to different destinations. Or you can chain several flume agents together to implement complex routing function. But the chained flume agents will become a little hard to maintain (resource usage and flume topology). You can have a look at flume-ng router sink, it may provide some function you want.

    First, add specific fields in event header by flume interceptor

    a1.sources = r1 r2
    a1.channels = c1 c2
    a1.sources.r1.channels =  c1
    a1.sources.r1.type = seq
    a1.sources.r1.interceptors = i1
    a1.sources.r1.interceptors.i1.type = static
    a1.sources.r1.interceptors.i1.key = datacenter
    a1.sources.r1.interceptors.i1.value = NEW_YORK
    a1.sources.r2.channels =  c2
    a1.sources.r2.type = seq
    a1.sources.r2.interceptors = i2
    a1.sources.r2.interceptors.i2.type = static
    a1.sources.r2.interceptors.i2.key = datacenter
    a1.sources.r2.interceptors.i2.value = BERKELEY
    

    Then, you can setup your flume channel selector like:

    a2.sources = r2
    a2.sources.channels = c1 c2 c3 c4
    a2.sources.r2.selector.type = multiplexing
    a2.sources.r2.selector.header = datacenter
    a2.sources.r2.selector.mapping.NEW_YORK = c1
    a2.sources.r2.selector.mapping.BERKELEY= c2 c3
    a2.sources.r2.selector.default = c4
    

    Or, you can setup avro-router sink like:

    agent.sinks.routerSink.type = com.datums.stream.AvroRouterSink
    agent.sinks.routerSink.hostname = test_host
    agent.sinks.routerSink.port = 34541
    agent.sinks.routerSink.channel = memoryChannel
    
    # Set sink name
    agent.sinks.routerSink.component.name = AvroRouterSink
    
    # Set header name for routing
    agent.sinks.routerSink.condition = datacenter
    
    # Set routing conditions
    agent.sinks.routerSink.conditions = east,west
    agent.sinks.routerSink.conditions.east.if = ^NEW_YORK
    agent.sinks.routerSink.conditions.east.then.hostname = east_host
    agent.sinks.routerSink.conditions.east.then.port = 34542
    agent.sinks.routerSink.conditions.west.if = ^BERKELEY
    agent.sinks.routerSink.conditions.west.then.hostname = west_host
    agent.sinks.routerSink.conditions.west.then.port = 34543
    
    0 讨论(0)
提交回复
热议问题