问题
when my Flink program is in event time mode, sink will not get last line(say line A). If I feed new line(line B) to Flink, I will get the line A, but I still cann't get the line b.
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.enableCheckpointing(5000, CheckpointingMode.EXACTLY_ONCE)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("group.id", "test")
val consumer = new FlinkKafkaConsumer[String]("topic", new SimpleStringSchema(), properties)
val stream: DataStream[String] = env.addSource(consumer).setParallelism(1)
stream.map { m =>
val result = JSON.parseFull(m).asInstanceOf[Some[Map[String, Any]]].get
val msg = result("message").asInstanceOf[String]
val num = parseMessage(msg)
val key = s"${num.zoneId} ${num.subZoneId}"
(key, num, num.onlineNum)
}.filter { data =>
data._2.subZoneId == 301 && data._2.zoneId == 5002
}.assignTimestampsAndWatermarks(new MyTimestampExtractor()).keyBy(0)
.window(TumblingEventTimeWindows.of(Time.seconds(1)))
.allowedLateness(Time.minutes(1))
.maxBy(2).addSink { v =>
System.out.println(s"${v._2.time} ${v._1}: ${v._2.onlineNum} ")
}
class MyTimestampExtractor() extends AscendingTimestampExtractor[(String, OnlineNum, Int)](){
val byMinute = new java.text.SimpleDateFormat("yyyy-MM-dd HH:mm:SS")
override def extractAscendingTimestamp(element: (String, OnlineNum, Int)): Long = {
val dateTimeString = element._2.date + " " + element._2.time
val c1 = byMinute.parse(dateTimeString).getTime
if ( element._2.time.contains("22:59") && element._2.subZoneId == 301){
//System.out.println(s"${element._2.time} ${element._1}: ${element._2.onlineNum} ")
// System.out.println(s"${element._2.time} ${c1 - getCurrentWatermark.getTimestamp}")
}
// System.out.println(s"${element._2.time} ${c1} ${c1 - getCurrentWatermark.getTimestamp}")
return c1
}
}
data sample:
01:01:14 5002 301: 29
01:01:36 5002 301: 27
01:02:05 5002 301: 27
01:02:31 5002 301: 29
01:03:02 5002 301: 29
01:03:50 5002 301: 29
01:04:52 5002 301: 29
01:07:24 5002 301: 26
01:09:28 5002 301: 21
01:11:04 5002 301: 22
01:12:11 5002 301: 24
01:13:54 5002 301: 23
01:15:13 5002 301: 22
01:16:04 5002 301: 19 (I can not get this line )
Then I push new line to Flink(via kafka)
01:17:28 5002 301: 15
I will get 01:16:04 5002 301: 19
, but 01:17:28 5002 301: 15
may be held in Flink.
回答1:
this happens because it's event time and the event's timestamp is used to measure the flow of time for windows.
In such case, when only one event is in the window Flink does not know that the window should be omitted. For this reason, when You add next event, the previous window is closed and elements are emitted (in your case 19), but then again next window is created (in your case 15).
Probably the best idea in such case is to add custom ProcessingTimeTrigger
which will basically allow You to emit the window after some time has flown, no matter if the events are flowing or not. You can find info about Trigger in the documentation.
回答2:
What is the final solution, please? I also encountered a similar situation, which can be solved by using new Watermark(System.CurrtTimeMillis()), but it does not seem to fit the purpose of Watermark. Isn't this a common problem, or are application developers deliberately ignoring it and communities ignoring it?
Why not on-time when I consumed kafka message using flink streaming sql group by TUMBLE(rowtime)?
来源:https://stackoverflow.com/questions/55499764/how-to-let-flink-flush-last-line-to-sink-when-producerkafka-does-not-produce-n