Flink之Unio、coflatmap、CoGroup、Join以及Connect
问题导读
1.Flink 双数据流转换为单数据流操作有哪些? 2.cogroup, join和coflatmap各自完成什么事情? 3.cogroup, join和coflatmap区别是什么?
Flink 双数据流转换为单数据流操作的运算有cogroup, join,coflatmap与union。下面为大家对比介绍下这4个运算的功能和用法。
- Join:只输出条件匹配的元素对。
- CoGroup: 除了输出匹配的元素对以外,未能匹配的元素也会输出。
- CoFlatMap:没有匹配条件,不进行匹配,分别处理两个流的元素。在此基础上完全可以实现join和cogroup的功能,比他们使用上更加自由。
join实例代码如下:
private static DataStream<PositionJoinModel> PositionTestJoin(
DataStream<ZongShu> grades,
DataStream<ZongShu> salaries,
long windowSize) {
DataStream<PositionJoinModel> apply =grades.join(salaries)
//join的条件stream1中的某个字段和stream2中的字段值相等
.where(new partitionsKeySelector1())
.equalTo(new partitionsKeySelector1())
// 指定window,stream1和stream2中的数据会进入到该window中。只有该window中的数据才会被后续操作join
.window(TumblingProcessingTimeWindows.of(Time.milliseconds(windowSize)))
.apply(new JoinFunction<ZongShu, ZongShu, PositionJoinModel>() {
// 捕获到匹配的数据t1和t2,在这里可以进行组装等操作
@Override
public PositionJoinModel join(
ZongShu first,
ZongShu second) {
return new PositionJoinModel(first.getRoom(), first.getPartitions(),first.getNum(), second.getNum());
}
});
return apply;
}
CoGroup实例代码:
private static DataStream<YCSB_LB_RESULT_Model> YCLB_Result_CGroup(
DataStream<YCSB_LB_Model> grades,
DataStream<YCSB_LB_Model> salaries,
long windowSize) {
DataStream<YCSB_LB_RESULT_Model> apply = grades.coGroup(salaries)
.where(new YCFB_Result_KeySelector())
.equalTo(new YCFB_Result_KeySelector())
.window(TumblingProcessingTimeWindows.of(Time.milliseconds(windowSize)))
.apply(new CoGroupFunction<YCSB_LB_Model, YCSB_LB_Model, YCSB_LB_RESULT_Model>() {
YCSB_LB_RESULT_Model ylrm = null;
@Override
public void coGroup(Iterable<YCSB_LB_Model> first, Iterable<YCSB_LB_Model> second, Collector<YCSB_LB_RESULT_Model> collector) throws Exception {
ylrm = new YCSB_LB_RESULT_Model();
for (YCSB_LB_Model s : first) {
String asset_id = s.getAsset_id();
ylrm.setAsset_id(asset_id);
ylrm.setName(s.getName());
ylrm.setIp(s.getIp());
ylrm.setRoom(s.getRoom());
ylrm.setPartitions(s.getPartitions());
ylrm.setBox(s.getBox());
ylrm.setLevel_1(s.getNum());
}
for (YCSB_LB_Model s1 : second) {
ylrm.setLevel_2(s1.getNum());
}
collector.collect(ylrm);
}
});
return apply;
}
coflatmap实例代码:
DataStream<Tuple2<String, Integer>> grades = WindowJoinSampleData.GradeSource.getSource(env, rate);
DataStream<Tuple2<String, Integer>> salaries = WindowJoinSampleData.SalarySource.getSource(env, rate);
KeyedStream<Tuple2<String, Integer>, Tuple> tuple2TupleKeyedStream = grades.keyBy(0);
KeyedStream<Tuple2<String, Integer>, Tuple> tuple2TupleKeyedStream1 = salaries.keyBy(0);
SingleOutputStreamOperator<Tuple3<String, Integer, Integer>> tuple3SingleOutputStreamOperator = tuple2TupleKeyedStream
.connect(tuple2TupleKeyedStream1)
.flatMap(new EnrichmentFunction());
public static class EnrichmentFunction extends RichCoFlatMapFunction<Tuple2<String,Integer>, Tuple2<String,Integer>, Tuple3<String, Integer,Integer>> {
// keyed, managed state
private ValueState<Tuple2<String,Integer>> rideState;
private ValueState<Tuple2<String,Integer>> fareState;
@Override
public void open(Configuration config) {
rideState = getRuntimeContext().getState(new ValueStateDescriptor<>("saved ride", TypeInformation.of(new TypeHint<Tuple2<String,Integer>>() {
})));
fareState = getRuntimeContext().getState(new ValueStateDescriptor<>("saved fare", TypeInformation.of(new TypeHint<Tuple2<String,Integer>>() {
})));
}
@Override
public void flatMap1(Tuple2<String,Integer> ride, Collector<Tuple3<String,Integer,Integer>> out) throws Exception {
Tuple2<String,Integer> fare = fareState.value();
if (fare != null) {
fareState.clear();
out.collect(new Tuple3(ride.f0,ride.f1, fare.f1));
} else {
rideState.update(ride);
}
}
@Override
public void flatMap2(Tuple2<String,Integer> fare, Collector<Tuple3<String,Integer,Integer>> out) throws Exception {
Tuple2<String,Integer> ride = rideState.value();
if (ride != null) {
rideState.clear();
out.collect(new Tuple3(ride.f0,ride.f1, fare.f1));
} else {
fareState.update(fare);
}
}
}
总结
union虽然可以合并多个数据流,但有一个限制,即多个数据流的数据类型必须相同。connect提供了和union类似的功能,用来连接两个数据流,它与union的区别在于:
- connect只能连接两个数据流,union可以连接多个数据流。
- connect所连接的两个数据流的数据类型可以不一致,union所连接的两个数据流的数据类型必须一致。
- 两个DataStream经过connect之后被转化为ConnectedStreams,ConnectedStreams会对两个流的数据应用不同的处理方法,且双流之间可以共享状态。
来源:oschina
链接:https://my.oschina.net/112612/blog/3215689