问题
Now we use tumbling window to count distinct. The issue we have is if we extend our tumbling window from day to month, We can't have the number as of now distinct count. That means if we set the tumbling window as 1 month, the number we get is from every 1st of each month. How can I get the current distinct count for now(Now is Mar 9.)?
package flink.trigger;
import org.apache.flink.api.common.state.ReducingState;
import org.apache.flink.api.common.state.ReducingStateDescriptor;
import org.apache.flink.api.common.typeutils.base.LongSerializer;
import org.apache.flink.streaming.api.windowing.triggers.Trigger;
import org.apache.flink.streaming.api.windowing.triggers.TriggerResult;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import java.text.SimpleDateFormat;
import java.util.Date;
public class CustomCountDistinctTigger extends Trigger<Object, TimeWindow> {
private final ReducingStateDescriptor<Long> timeState =
new ReducingStateDescriptor<>("fire-interval", new DistinctCountAggregateFunction(), LongSerializer.INSTANCE);
private long interval;
public CustomCountDistinctTigger(long interval) {
this.interval = interval;
}
@Override
public TriggerResult onElement(Object element, long timestamp, TimeWindow window, TriggerContext ctx) throws Exception {
ReducingState<Long> fireTimestamp = ctx.getPartitionedState(timeState);
timestamp = ctx.getCurrentProcessingTime();
if (fireTimestamp.get() == null) {
long start = timestamp - (timestamp % interval);
long nextFireTimestamp = start + interval;
ctx.registerProcessingTimeTimer(nextFireTimestamp);
fireTimestamp.add(nextFireTimestamp);
return TriggerResult.CONTINUE;
}
return TriggerResult.CONTINUE;
}
@Override
public TriggerResult onProcessingTime(long time, TimeWindow window, TriggerContext ctx) throws Exception {
// System.out.println("onProcessingTime called at "+System.currentTimeMillis() );
// return TriggerResult.FIRE_AND_PURGE;
SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
System.out.println(df.format(new Date()));
//interval
ReducingState<Long> fireTimestamp = ctx.getPartitionedState(timeState);
if(window.maxTimestamp() == time) {
return TriggerResult.FIRE_AND_PURGE;
}
else if (fireTimestamp.get().equals(time)) {
fireTimestamp.clear();
fireTimestamp.add(time + interval);
ctx.registerProcessingTimeTimer(time + interval);
return TriggerResult.FIRE;
}
return TriggerResult.CONTINUE;
}
@Override
public TriggerResult onEventTime(long time, TimeWindow window, TriggerContext ctx) throws Exception {
return TriggerResult.CONTINUE;
}
@Override
public void clear(TimeWindow window, TriggerContext ctx) throws Exception {
}
}
distinct count:
DataStreamSink<Tuple2<String, Integer>> finalResultStream = keyedStream
.flatMap(new KPIDistinctDataFlatMapFunction(inputSchema))
.map(new SwapMap())
.keyBy(new WordKeySelector())
.window(TumblingProcessingTimeWindows.of(org.apache.flink.streaming.api.windowing.time.Time.minutes(5)))
.trigger(new CustomCountDistinctTigger(1 * 60 * 6000))
.aggregate(new DistinctCountAggregateFunction())
.print("final print");
回答1:
You can define a custom Trigger that returns FIRE once a day to trigger intermediate results, and then does a FIRE_AND_PURGE at the end of the month to close the window.
Every time the Trigger returns FIRE your window is evaluated by calling the process()
method of your ProcessWindowFunction
, at which point it can produce results with the Collector
that is provided. FIRE_AND_PURGE evaluates the window one last time, and then destroys it.
See also the answers to this question -- How to display intermediate results in a windowed streaming-etl? -- which covered a related topic.
来源:https://stackoverflow.com/questions/60599700/flink-count-distinct-issue