问题
I have two flink dataStream
. For ex: dataStream1
and dataStream2
. I want to union both the Streams into 1 stream so that I can process them using the same process functions as the dag of both dataStream
is the same.
As of now, I need equal priority of consumption of messages for either stream. The producer of dataStream2 produces 10 messages per minute, while the producer of dataStream1 produces 1000 messages per second. Also, dataTypes are the same for both dataStreams.DataSteam2 more of a high priority queue that should be consumed asap. There is no relation between messages of dataStream1 and dataStream2
Does dataStream1.union(dataStream2)
will produce a Stream that will have elements of both Streams?
回答1:
Probably the simplest solution to this problem, yet not exactly the most efficient one depending on the exact specification of the sources for Your data, may be connecting the two streams. In this solution, You could use the CoProcessFunction
, which will invoke separate methods for each of the connected streams.
In this solution, You could simply buffer the elements of one stream until they can be produced (for example in round-robin manner). But keep in mind that this may be quite inefficient if there is a very big difference between the frequency in which sources produce events.
回答2:
It sounds like the two DataStream
s have different types of elements, though you didn't specify that explicitly. If that's the case, then create an Either<stream1 type, stream2 type>
via a MapFunction
on each stream, then union()
the two streams. You won't get exact intermingling of the two, as Flink will alternate consuming from each stream's network buffer.
If you really want nicely mixed streams, then (as others have noted) you'll need to buffer incoming elements via state, and also apply some heuristics to avoid over-buffering if for any reason (e.g. differing network latency, or more likely different performance between the two sources) you have very different data rates between the two streams.
回答3:
You may want to use a custom operator that implements the InputSelectable
interface in order to reduce the amount of buffering needed. I've included an example below that implements interleaving without any buffering, but be sure to read the caveat in the docs which explains that
... the operator may receive some data that it does not currently want to process ...
In other words, this simple example can't be relied upon to really work as is.
public class Alternate {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStream<Long> positive = env.generateSequence(1L, 100L);
DataStream<Long> negative = env.generateSequence(-100L, -1L);
AlternatingTwoInputStreamOperator op = new AlternatingTwoInputStreamOperator();
positive
.connect(negative)
.transform("Hack that needs buffering", Types.LONG, op)
.print();
env.execute();
}
}
class AlternatingTwoInputStreamOperator extends AbstractStreamOperator<Long>
implements TwoInputStreamOperator<Long, Long, Long>, InputSelectable {
private InputSelection nextSelection = InputSelection.FIRST;
@Override
public void processElement1(StreamRecord<Long> element) throws Exception {
output.collect(element);
nextSelection = InputSelection.SECOND;
}
@Override
public void processElement2(StreamRecord<Long> element) throws Exception {
output.collect(element);
nextSelection = InputSelection.FIRST;
}
@Override
public InputSelection nextSelection() {
return this.nextSelection;
}
}
Note also that InputSelectable
was added in Flink 1.9.0.
来源:https://stackoverflow.com/questions/59742165/consume-from-two-flink-datastream-based-on-priority-or-round-robin-way