apache-flink

Can't understand the result for the event time group by window

风格不统一 提交于 2021-02-11 14:41:24
问题 I am using Flink 1.12.0 and have a data collection and use that to try out the event time group window.Following is the full code. package org.example.sqlexploration import java.sql.Timestamp import java.text.SimpleDateFormat import org.apache.flink.streaming.api.TimeCharacteristic import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks import org.apache.flink.streaming.api.scala._ import org.apache.flink.streaming.api.watermark.Watermark import org.apache.flink.table

flink count distinct issue

怎甘沉沦 提交于 2021-02-11 14:26:39
问题 Now we use tumbling window to count distinct. The issue we have is if we extend our tumbling window from day to month, We can't have the number as of now distinct count. That means if we set the tumbling window as 1 month, the number we get is from every 1st of each month. How can I get the current distinct count for now(Now is Mar 9.)? package flink.trigger; import org.apache.flink.api.common.state.ReducingState; import org.apache.flink.api.common.state.ReducingStateDescriptor; import org

Is it possible to recover after losing the checkpoint coordinator

自作多情 提交于 2021-02-11 13:23:29
问题 I'm using incremental checkpoint with RocksDB and saving the checkpoints into a remote destination(S3 in my case). What will happen if someone deletes the job manager server (where the checkpoint coordinator operates) and reinstall it? By losing the checkpoint coordinator I also lose the option to recover the state from the checkpoints? because from what I know, the coordinator holds all the references of the checkpoints. 回答1: If you run Flink with high availability enabled, then Flink will

How to enrich event stream with big file in Apache Flink?

落花浮王杯 提交于 2021-02-10 18:34:43
问题 I have a Flink application for click stream collection and processing. The application consists of Kafka as event source, a map function and a sink as image shown below: I want to enrich the incoming click stream data with user's IP location based on userIp field in raw event ingested from Kafka. a simplified slice of the CSV file as shown below start_ip,end_ip,country "1.1.1.1","100.100.100.100","United States of America" "100.100.100.101","200.200.200.200","China" I have made some

Apache Flink Tumbling Window delayed result

风流意气都作罢 提交于 2021-02-10 18:30:46
问题 Met an issue with an apache flink app using tumbling window. The window size is 10 seconds and I expect to have the resultSet DataStream every 10 seconds. However when the resultSet of the latest window is always delayed unless I push further data to the source stream. For example, if I push several records to the source stream between '01:33:40.0' and '01:34:00.0' and then stop to watch the log nothing will happen. I push some data again on '01:37:XX' and then will get the resultSet of the

Apache Flink Tumbling Window delayed result

爷,独闯天下 提交于 2021-02-10 18:30:31
问题 Met an issue with an apache flink app using tumbling window. The window size is 10 seconds and I expect to have the resultSet DataStream every 10 seconds. However when the resultSet of the latest window is always delayed unless I push further data to the source stream. For example, if I push several records to the source stream between '01:33:40.0' and '01:34:00.0' and then stop to watch the log nothing will happen. I push some data again on '01:37:XX' and then will get the resultSet of the

how to manage many avsc files in flink when consuming multiple topics gracefully

南笙酒味 提交于 2021-02-10 18:26:20
问题 Here is my case: I use flink to consume many topics in Kafka with SimpleStringSchema. OutputTag is used since we need to bucket the data in Parquet + Snappy into directories by topic later. Then we go through all the topics while each topic is processed with AVSC schema file. Now I have to modify the avsc schema file when some new columns added. It'll make me in trouble when ten or hundred files needed to modify. So is there a more graceful way to avoid changing the avsc file or how to manage

Adding custom dependencies for a Plugin in a Flink cluster

▼魔方 西西 提交于 2021-02-10 12:51:50
问题 I have a Flink session cluster (Job Manager + Task Manager), version 1.11.1, with configured log4j-console.properties to include Kafka appender. In addition, in both Job Manager and Task Manager I'm enabling flink-s3-fs-hadoop built-in plugin. I've added kafka-clients jar to the flink/lib directory, which is necessary for the container to be running. But I'm still getting the below class loading error when the S3 plugin is being instantiated (and initializing the logger). Caused by: org

How to unit test a Flink ProcessFunction?

眉间皱痕 提交于 2021-02-10 05:20:47
问题 I have a simple ProcessFunction that takes in String as input and gives a String as output. How do I unit test this using Junit? As the processElement method is a void method and returns no value. public class SampleProcessFunction extends ProcessFunction<String, String>{ @Override public void processElement(String content, Context context, Collector<String> collector) throws Exception { String output = content + "output"; collector.collect(output); } } 回答1: In order to unit test this method,

How to unit test a Flink ProcessFunction?

馋奶兔 提交于 2021-02-10 05:20:47
问题 I have a simple ProcessFunction that takes in String as input and gives a String as output. How do I unit test this using Junit? As the processElement method is a void method and returns no value. public class SampleProcessFunction extends ProcessFunction<String, String>{ @Override public void processElement(String content, Context context, Collector<String> collector) throws Exception { String output = content + "output"; collector.collect(output); } } 回答1: In order to unit test this method,