flink-streaming

Apache Flink: Count window with timeout

大憨熊 提交于 2019-12-02 02:34:22
Here is a simple code example to illustrate my question: case class Record( key: String, value: Int ) object Job extends App { val env = StreamExecutionEnvironment.getExecutionEnvironment val data = env.fromElements( Record("01",1), Record("02",2), Record("03",3), Record("04",4), Record("05",5) ) val step1 = data.filter( record => record.value % 3 != 0 ) // introduces some data loss val step2 = data.map( r => Record( r.key, r.value * 2 ) ) val step3 = data.map( r => Record( r.key, r.value * 3 ) ) val merged = step1.union( step2, step3 ) val keyed = merged.keyBy(0) val windowed = keyed

Apache Flink: How to count the total number of events in a DataStream

白昼怎懂夜的黑 提交于 2019-12-01 13:11:00
I have two raw streams and I am joining those streams and then I want to count what is the total number of events that have been joined and how much events have not. I am doing this by using map on joinedEventDataStream as shown below joinedEventDataStream.map(new RichMapFunction<JoinedEvent, Object>() { @Override public Object map(JoinedEvent joinedEvent) throws Exception { number_of_joined_events += 1; return null; } }); Question # 1: Is this the appropriate way to count the number of events in the stream? Question # 2: I have noticed a wired behavior, which some of you might not believe.

Ordering of Records in Stream

女生的网名这么多〃 提交于 2019-12-01 12:40:27
Here are some of the queries I have : I have two different streams stream1 and stream2 in which the elements are in order. 1) Now when I do keyBy on each of these streams, will the order be maintained? (Since every group here will be sent to one task manager only ) My understanding is that the records will be in order for a group, correct me here. 2) After the keyBy on both of the streams I am doing co-group to get the matching and non-matching records. Will the order be maintained here also?, since this also works on KeyedStream . I am using EventTime , and AscendingTimestampExtractor for

Throughput and Latency on Apache Flink

▼魔方 西西 提交于 2019-12-01 11:18:53
I have written a very simple java program for Apache Flink and now I am interested in measuring statistics such as throughput (number of tuples processed per second) and latency (the time the program needs to process every input tuple). StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.readTextFile("/home/LizardKing/Documents/Power/Prova.csv") .map(new MyMapper().writeAsCsv("/home/LizardKing/Results.csv"); JobExecutionResult res = env.execute(); I know that Flink exposes some metrics: https://ci.apache.org/projects/flink/flink-docs-release-1.2

Apache Flink: Kafka connector in Python streaming API, “Cannot load user class”

孤街浪徒 提交于 2019-12-01 10:36:22
I am trying out Flink's new Python streaming API and attempting to run my script with ./flink-1.6.1/bin/pyflink-stream.sh examples/read_from_kafka.py . The python script is fairly straightforward, I am just trying to consume from an existing topic and send everything to stdout (or the *.out file in the log directory where the output method emits data by default). import glob import os import sys from java.util import Properties from org.apache.flink.streaming.api.functions.source import SourceFunction from org.apache.flink.streaming.api.collector.selector import OutputSelector from org.apache

Apache Flink: How to count the total number of events in a DataStream

可紊 提交于 2019-12-01 10:18:36
问题 I have two raw streams and I am joining those streams and then I want to count what is the total number of events that have been joined and how much events have not. I am doing this by using map on joinedEventDataStream as shown below joinedEventDataStream.map(new RichMapFunction<JoinedEvent, Object>() { @Override public Object map(JoinedEvent joinedEvent) throws Exception { number_of_joined_events += 1; return null; } }); Question # 1: Is this the appropriate way to count the number of

Apache Flink: Kafka connector in Python streaming API, “Cannot load user class”

大城市里の小女人 提交于 2019-12-01 09:25:20
问题 I am trying out Flink's new Python streaming API and attempting to run my script with ./flink-1.6.1/bin/pyflink-stream.sh examples/read_from_kafka.py . The python script is fairly straightforward, I am just trying to consume from an existing topic and send everything to stdout (or the *.out file in the log directory where the output method emits data by default). import glob import os import sys from java.util import Properties from org.apache.flink.streaming.api.functions.source import

Throughput and Latency on Apache Flink

半城伤御伤魂 提交于 2019-12-01 07:27:52
问题 I have written a very simple java program for Apache Flink and now I am interested in measuring statistics such as throughput (number of tuples processed per second) and latency (the time the program needs to process every input tuple). StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.readTextFile("/home/LizardKing/Documents/Power/Prova.csv") .map(new MyMapper().writeAsCsv("/home/LizardKing/Results.csv"); JobExecutionResult res = env.execute(); I know

How to count unique words in a stream?

我与影子孤独终老i 提交于 2019-12-01 03:07:50
Is there a way to count the number of unique words in a stream with Flink Streaming? The results would be a stream of number which keeps increasing. You can solve the problem by storing all words which you've already seen. Having this knowledge you can filter out all duplicate words. The rest can then be counted by a map operator with parallelism 1 . The following code snippet does exactly that. val env = StreamExecutionEnvironment.getExecutionEnvironment val inputStream = env.fromElements("foo", "bar", "foobar", "bar", "barfoo", "foobar", "foo", "fo") // filter words out which we have already

Apache flink on Kubernetes - Resume job if jobmanager crashes

三世轮回 提交于 2019-11-30 09:55:27
I want to run a flink job on kubernetes, using a (persistent) state backend it seems like crashing taskmanagers are no issue as they can ask the jobmanager which checkpoint they need to recover from, if I understand correctly. A crashing jobmanager seems to be a bit more difficult. On this flip-6 page I read zookeeper is needed to be able to know what checkpoint the jobmanager needs to use to recover and for leader election. Seeing as kubernetes will restart the jobmanager whenever it crashes is there a way for the new jobmanager to resume the job without having to setup a zookeeper cluster?