apache flink 0.10 how to get the first occurence of a composite key from an unbounded input dataStream?

前端 未结 2 675
生来不讨喜
生来不讨喜 2021-01-01 01:20

i am a newbie with apache flink. i have an unbound data stream in my input (fed into flink 0.10 via kakfa).

i want to get the 1st occurence of each primary key (the

相关标签:
2条回答
  • 2021-01-01 01:39

    Filtering duplicates over an infinite stream will eventually fail if your key space is larger than your available storage space. The reason is that you have to store the already seen keys somewhere to filter out the duplicates. Thus, it would be good to define a time window after which you can purge the current set of seen keys.

    If you're aware of this problem but want to try it anyway, you can do it by applying a stateful flatMap operation after the keyBy call. The stateful mapper uses Flink's state abstraction to store whether it has already seen an element with this key or not. That way, you will also benefit from Flink's fault tolerance mechanism because your state will be automatically checkpointed.

    A Flink program doing your job could look like

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    
        DataStream<Tuple3<String, Date, String>> input = env.fromElements(Tuple3.of("foo", new Date(1000), "bar"), Tuple3.of("foo", new Date(1000), "foobar"));
    
        input.keyBy(0, 1).flatMap(new DuplicateFilter()).print();
    
        env.execute("Test");
    }
    

    where the implementation of DuplicateFilter depends on the version of Flink.

    Version >= 1.0 implementation

    public static class DuplicateFilter extends RichFlatMapFunction<Tuple3<String, Date, String>, Tuple3<String, Date, String>> {
    
        static final ValueStateDescriptor<Boolean> descriptor = new ValueStateDescriptor<>("seen", Boolean.class, false);
        private ValueState<Boolean> operatorState;
    
        @Override
        public void open(Configuration configuration) {
            operatorState = this.getRuntimeContext().getState(descriptor);
        }
    
        @Override
        public void flatMap(Tuple3<String, Date, String> value, Collector<Tuple3<String, Date, String>> out) throws Exception {
            if (!operatorState.value()) {
                // we haven't seen the element yet
                out.collect(value);
                // set operator state to true so that we don't emit elements with this key again
                operatorState.update(true);
            }
        }
    }
    

    Version 0.10 implementation

    public static class DuplicateFilter extends RichFlatMapFunction<Tuple3<String, Date, String>, Tuple3<String, Date, String>> {
    
        private OperatorState<Boolean> operatorState;
    
        @Override
        public void open(Configuration configuration) {
            operatorState = this.getRuntimeContext().getKeyValueState("seen", Boolean.class, false);
        }
    
        @Override
        public void flatMap(Tuple3<String, Date, String> value, Collector<Tuple3<String, Date, String>> out) throws Exception {
            if (!operatorState.value()) {
                // we haven't seen the element yet
                out.collect(value);
                operatorState.update(true);
            }
        }
    }
    

    Update: Using a tumbling time window

    input.keyBy(0, 1).timeWindow(Time.seconds(1)).apply(new WindowFunction<Iterable<Tuple3<String,Date,String>>, Tuple3<String, Date, String>, Tuple, TimeWindow>() {
        @Override
        public void apply(Tuple tuple, TimeWindow window, Iterable<Tuple3<String, Date, String>> input, Collector<Tuple3<String, Date, String>> out) throws Exception {
            out.collect(input.iterator().next());
        }
    })
    
    0 讨论(0)
  • 2021-01-01 01:44

    Here's another way to do this that I happen to have just written. It has the disadvantage that it's a bit more custom code since it doesn't use the built-in Flink windowing functions but it doesn't have the latency penalty that Till mentioned. Full example on GitHub.

    package com.dataartisans.filters;
    
    import com.google.common.cache.CacheBuilder;
    import com.google.common.cache.CacheLoader;
    import com.google.common.cache.LoadingCache;
    import org.apache.flink.api.common.functions.RichFilterFunction;
    import org.apache.flink.api.java.functions.KeySelector;
    import org.apache.flink.configuration.Configuration;
    import org.apache.flink.streaming.api.checkpoint.CheckpointedAsynchronously;
    
    import java.io.Serializable;
    import java.util.HashSet;
    import java.util.concurrent.TimeUnit;
    
    
    /**
     * This class filters duplicates that occur within a configurable time of each other in a data stream.
     */
    public class DedupeFilterFunction<T, K extends Serializable> extends RichFilterFunction<T> implements CheckpointedAsynchronously<HashSet<K>> {
    
      private LoadingCache<K, Boolean> dedupeCache;
      private final KeySelector<T, K> keySelector;
      private final long cacheExpirationTimeMs;
    
      /**
       * @param cacheExpirationTimeMs The expiration time for elements in the cache
       */
      public DedupeFilterFunction(KeySelector<T, K> keySelector, long cacheExpirationTimeMs){
        this.keySelector = keySelector;
        this.cacheExpirationTimeMs = cacheExpirationTimeMs;
      }
    
      @Override
      public void open(Configuration parameters) throws Exception {
        createDedupeCache();
      }
    
    
      @Override
      public boolean filter(T value) throws Exception {
        K key = keySelector.getKey(value);
        boolean seen = dedupeCache.get(key);
        if (!seen) {
          dedupeCache.put(key, true);
          return true;
        } else {
          return false;
        }
      }
    
      @Override
      public HashSet<K> snapshotState(long checkpointId, long checkpointTimestamp) throws Exception {
        return new HashSet<>(dedupeCache.asMap().keySet());
      }
    
      @Override
      public void restoreState(HashSet<K> state) throws Exception {
        createDedupeCache();
        for (K key : state) {
          dedupeCache.put(key, true);
        }
      }
    
      private void createDedupeCache() {
        dedupeCache = CacheBuilder.newBuilder()
          .expireAfterWrite(cacheExpirationTimeMs, TimeUnit.MILLISECONDS)
          .build(new CacheLoader<K, Boolean>() {
            @Override
            public Boolean load(K k) throws Exception {
              return false;
            }
          });
      }
    }
    
    0 讨论(0)
提交回复
热议问题