How to let Flink flush last line to sink when producer(kafka) does not produce new line

≯℡__Kan透↙ 提交于 2019-12-13 03:49:51

问题


when my Flink program is in event time mode, sink will not get last line(say line A). If I feed new line(line B) to Flink, I will get the line A, but I still cann't get the line b.

    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    env.enableCheckpointing(5000, CheckpointingMode.EXACTLY_ONCE)
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    val properties = new Properties()
    properties.setProperty("bootstrap.servers", "localhost:9092")
    properties.setProperty("group.id", "test")

    val consumer = new FlinkKafkaConsumer[String]("topic", new SimpleStringSchema(), properties)

    val stream: DataStream[String] = env.addSource(consumer).setParallelism(1)

    stream.map { m =>
      val result = JSON.parseFull(m).asInstanceOf[Some[Map[String, Any]]].get
      val msg = result("message").asInstanceOf[String]
      val num = parseMessage(msg)
      val key = s"${num.zoneId} ${num.subZoneId}"
      (key, num, num.onlineNum)
    }.filter { data =>
      data._2.subZoneId == 301 && data._2.zoneId == 5002
    }.assignTimestampsAndWatermarks(new MyTimestampExtractor()).keyBy(0)
      .window(TumblingEventTimeWindows.of(Time.seconds(1)))
        .allowedLateness(Time.minutes(1))
      .maxBy(2).addSink { v =>
      System.out.println(s"${v._2.time} ${v._1}: ${v._2.onlineNum} ")
    }
class MyTimestampExtractor() extends AscendingTimestampExtractor[(String, OnlineNum, Int)](){
  val byMinute = new java.text.SimpleDateFormat("yyyy-MM-dd HH:mm:SS")
  override def extractAscendingTimestamp(element: (String, OnlineNum, Int)): Long = {
    val dateTimeString =  element._2.date + " " + element._2.time
    val c1 = byMinute.parse(dateTimeString).getTime
    if ( element._2.time.contains("22:59") && element._2.subZoneId == 301){
      //System.out.println(s"${element._2.time} ${element._1}: ${element._2.onlineNum} ")
      // System.out.println(s"${element._2.time} ${c1 - getCurrentWatermark.getTimestamp}")
    }

    // System.out.println(s"${element._2.time} ${c1} ${c1 - getCurrentWatermark.getTimestamp}")
    return c1
  }
}

data sample:

01:01:14 5002 301: 29 
01:01:36 5002 301: 27 
01:02:05 5002 301: 27 
01:02:31 5002 301: 29 
01:03:02 5002 301: 29 
01:03:50 5002 301: 29 
01:04:52 5002 301: 29 
01:07:24 5002 301: 26 
01:09:28 5002 301: 21 
01:11:04 5002 301: 22 
01:12:11 5002 301: 24 
01:13:54 5002 301: 23 
01:15:13 5002 301: 22 
01:16:04 5002 301: 19 (I can not get this line )

Then I push new line to Flink(via kafka)

01:17:28 5002 301: 15 

I will get 01:16:04 5002 301: 19, but 01:17:28 5002 301: 15 may be held in Flink.


回答1:


this happens because it's event time and the event's timestamp is used to measure the flow of time for windows.

In such case, when only one event is in the window Flink does not know that the window should be omitted. For this reason, when You add next event, the previous window is closed and elements are emitted (in your case 19), but then again next window is created (in your case 15).

Probably the best idea in such case is to add custom ProcessingTimeTrigger which will basically allow You to emit the window after some time has flown, no matter if the events are flowing or not. You can find info about Trigger in the documentation.




回答2:


What is the final solution, please? I also encountered a similar situation, which can be solved by using new Watermark(System.CurrtTimeMillis()), but it does not seem to fit the purpose of Watermark. Isn't this a common problem, or are application developers deliberately ignoring it and communities ignoring it?

Why not on-time when I consumed kafka message using flink streaming sql group by TUMBLE(rowtime)?



来源:https://stackoverflow.com/questions/55499764/how-to-let-flink-flush-last-line-to-sink-when-producerkafka-does-not-produce-n

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!