flink count distinct issue

怎甘沉沦 提交于 2021-02-11 14:26:39

问题


Now we use tumbling window to count distinct. The issue we have is if we extend our tumbling window from day to month, We can't have the number as of now distinct count. That means if we set the tumbling window as 1 month, the number we get is from every 1st of each month. How can I get the current distinct count for now(Now is Mar 9.)?

package flink.trigger;

import org.apache.flink.api.common.state.ReducingState;
import org.apache.flink.api.common.state.ReducingStateDescriptor;
import org.apache.flink.api.common.typeutils.base.LongSerializer;
import org.apache.flink.streaming.api.windowing.triggers.Trigger;
import org.apache.flink.streaming.api.windowing.triggers.TriggerResult;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;

import java.text.SimpleDateFormat;
import java.util.Date;

public class CustomCountDistinctTigger extends Trigger<Object, TimeWindow> {

    private final ReducingStateDescriptor<Long> timeState =
            new ReducingStateDescriptor<>("fire-interval", new DistinctCountAggregateFunction(), LongSerializer.INSTANCE);
    private long interval;


    public CustomCountDistinctTigger(long interval) {
        this.interval = interval;
    }

    @Override
    public TriggerResult onElement(Object element, long timestamp, TimeWindow window, TriggerContext ctx) throws Exception {
        ReducingState<Long> fireTimestamp = ctx.getPartitionedState(timeState);

        timestamp = ctx.getCurrentProcessingTime();

        if (fireTimestamp.get() == null) {
            long start = timestamp - (timestamp % interval);
            long nextFireTimestamp = start + interval;
            ctx.registerProcessingTimeTimer(nextFireTimestamp);
            fireTimestamp.add(nextFireTimestamp);
            return TriggerResult.CONTINUE;
        }
        return TriggerResult.CONTINUE;
    }

    @Override
    public TriggerResult onProcessingTime(long time, TimeWindow window, TriggerContext ctx) throws Exception {
//        System.out.println("onProcessingTime called at "+System.currentTimeMillis() );
//        return TriggerResult.FIRE_AND_PURGE;
        SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
        System.out.println(df.format(new Date()));
        //interval
        ReducingState<Long> fireTimestamp = ctx.getPartitionedState(timeState);

        if(window.maxTimestamp() == time) {
            return TriggerResult.FIRE_AND_PURGE;
        }
        else if (fireTimestamp.get().equals(time)) {
            fireTimestamp.clear();
            fireTimestamp.add(time + interval);
            ctx.registerProcessingTimeTimer(time + interval);
            return TriggerResult.FIRE;
        }
        return TriggerResult.CONTINUE;
    }

    @Override
    public TriggerResult onEventTime(long time, TimeWindow window, TriggerContext ctx) throws Exception {
        return TriggerResult.CONTINUE;
    }

    @Override
    public void clear(TimeWindow window, TriggerContext ctx) throws Exception {

    }

}


distinct count:
DataStreamSink<Tuple2<String, Integer>> finalResultStream = keyedStream
                            .flatMap(new KPIDistinctDataFlatMapFunction(inputSchema))
                            .map(new SwapMap())
                            .keyBy(new WordKeySelector())
                            .window(TumblingProcessingTimeWindows.of(org.apache.flink.streaming.api.windowing.time.Time.minutes(5)))
                            .trigger(new CustomCountDistinctTigger(1 * 60 * 6000))
                            .aggregate(new DistinctCountAggregateFunction())
                            .print("final print");

回答1:


You can define a custom Trigger that returns FIRE once a day to trigger intermediate results, and then does a FIRE_AND_PURGE at the end of the month to close the window.

Every time the Trigger returns FIRE your window is evaluated by calling the process() method of your ProcessWindowFunction, at which point it can produce results with the Collector that is provided. FIRE_AND_PURGE evaluates the window one last time, and then destroys it.

See also the answers to this question -- How to display intermediate results in a windowed streaming-etl? -- which covered a related topic.



来源:https://stackoverflow.com/questions/60599700/flink-count-distinct-issue

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!