flink业务使用记录
- 部署好flink集群,我的模式是flink on yarn
- 新建flink处理逻辑代码模块
- 将该模块打成可执行的jar放到整个项目中
- 在flink客户端执行提交作业操作
- 在flink管理页面上查看业务详情。
Flink窗口函数(Window Functions)
定义完窗口分配器后,我们还需要为每一个窗口指定我们需要执行的计算,这是窗口的责任,当系统决定一个窗口已经准备好执行之后,这个窗口函数将被用来处理窗口中的每一个元素(可能是分组的)。
请参考:https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/windows.html#triggers来了解当一个窗口准备好之后,Flink是如何决定的。
window函数可以是ReduceFunction
, FoldFunction
或者 WindowFunction
中的一个。前面两个更高效一些(),因为在每个窗口中增量地对每一个到达的元素执行聚合操作。一个 WindowFunction
可以获取一个窗口中的所有元素的一个迭代以及哪个元素属于哪个窗口的额外元信息。
有WindowFunction
的窗口化操作会比其他的操作效率要差一些,因为Flink内部在调用函数之前会将窗口中的所有元素都缓存起来。这个可以通过WindowFunction
和ReduceFunction
或者FoldFunction
结合使用来获取窗口中所有元素的增量聚合和WindowFunction
接收的额外的窗口元数据,接下来我们将看一看每一种变体的示例。
ReduceFunction
ReduceFunction
指定了如何通过两个输入的参数进行合并输出一个同类型的参数的过程,Flink使用ReduceFunction
来对窗口中的元素进行增量聚合。
一个ReduceFunction
可以通过如下的方式来定义和使用:
Java 代码:
DataStream<Tuple2<String, Long>> input = ...;
input
.keyBy(<key selector>)
.window(<window assigner>)
.reduce(new ReduceFunction<Tuple2<String, Long>> {
public Tuple2<String, Long> reduce(Tuple2<String, Long> v1, Tuple2<String, Long> v2) {
return new Tuple2<>(v1.f0, v1.f1 + v2.f1);
}
});
Scala 代码:
val input: DataStream[(String, Long)] = ...
input
.keyBy(<key selector>)
.window(<window assigner>)
.reduce { (v1, v2) => (v1._1, v1._2 + v2._2) }
上面的例子是将窗口所有元素中元组的第二个属性进行累加操作。
实例
有时候我们需要过滤数据,有些中间数据是不需要的,比如场景:
binlog 数据更新的时候,我们仅仅需要最新数据。会根据ID 分组,然后取version 最大的一条,存储
public class ReduceApp {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
DataStream<Order> userInfoDataStream = env.addSource(new OrderSource());
DataStream<Order> timedData = userInfoDataStream.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<Order>() {
@Override
public long extractAscendingTimestamp(Order element) {
return element.getMdTime().getTime();
}
});
SingleOutputStreamOperator<Order> reduce = timedData
.keyBy("id")
.timeWindow(Time.seconds(10), Time.seconds(5))
.reduce((ReduceFunction<Order>) (v1, v2) -> v1.getVersion() >= v2.getVersion() ? v1 : v2);
reduce.print();
env.execute("test");
}
public static class Order {
// 主键id
private Integer id;
// 版本
private Integer version;
private Timestamp mdTime;
public Order(int id, Integer version) {
this.id = id;
this.version = version;
this.mdTime = new Timestamp(System.currentTimeMillis());
}
public Order() {
}
}
}
//自定义Source
public class OrderSource implements SourceFunction<Order> {
Random random = new Random();
@Override
public void run(SourceContext<Order> ctx) throws Exception {
while (true) {
TimeUnit.MILLISECONDS.sleep(100);
// 为了区分,我们简单生0~2的id, 和版本0~99
int id = random.nextInt(3);
Order o = new Order(id, random.nextInt(100));
ctx.collect(o);
}
}
@Override
public void cancel() {
}
}
FoldFunction
FoldFunction
指定了一个输入元素如何与一个输出类型的元素合并的过程,这个FoldFunction
会被每一个加入到窗口中的元素和当前的输出值增量地调用,第一个元素是与一个预定义的类型为输出类型的初始值合并。
一个FoldFunction可以通过如下的方式定义和调用:
Java 代码:
DataStream<Tuple2<String, Long>> input = ...;
input
.keyBy(<key selector>)
.window(<window assigner>)
.fold("", new FoldFunction<Tuple2<String, Long>, String>> {
public String fold(String acc, Tuple2<String, Long> value) {
return acc + value.f1;
}
});
Scala 代码:
val input: DataStream[(String, Long)] = ...
input
.keyBy(<key selector>)
.window(<window assigner>)
.fold("") { (acc, v) => acc + v._2 }
上面例子追加所有输入的长整型到一个空的字符串中。
注意fold()
不能应用于会话窗口或者其他可合并的窗口中。
实例:
//该实例为FoldFunction窗口的实例,看完之后就明白该窗口的作用了,然后使用其上的代码进行使用
public class TansExamples {
public static void main(String[] args) throws Exception{
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> dataStream = env.fromElements("can you help me see can you help me you you");
DataStream<String> result = dataStream.flatMap(new FlatMapFunction<String, WordWithCount>() {
@Override
public void flatMap(String value, Collector<WordWithCount> out) throws Exception {
for(String word: value.split(" ")){
out.collect(new WordWithCount(word, 1));
}
}}).keyBy("word")
.fold("", new FoldFunction<WordWithCount, String>() {
@Override
public String fold(String current, WordWithCount value) throws Exception {
if(current.equals("start")){
return current + "_" + value.word + "_" + value.count;
}
else{
return current + "_" + value.count;
}});
result.print().setParallelism(1);
env.execute("test for map");
}
public static class WordWithCount{
public String word;
public Integer count;
public WordWithCount(){}
public WordWithCount(String word, Integer count){
this.word = word;
this.count = count;
}
@Override
public String toString(){
return word + ":" + count;
}
}
}
窗口函数 —— 一般用法(WindowFunction - The Generic Case)
一个WindowFunction
将获得一个包含了window
中的所有元素迭代(Iterable
),并且提供所有窗口函数的最大灵活性。这些带来了性能的成本和资源的消耗,因为window
中的元素无法进行增量迭代,而是缓存起来直到window
被认为是可以处理时为止。
WindowFunction
的使用说明如下:
Java 代码:
public interface WindowFunction<IN, OUT, KEY, W extends Window> extends Function, Serializable {
/**
// Evaluates the window and outputs none or several elements.
// @param key The key for which this window is evaluated.
// @param window The window that is being evaluated.
// @param input The elements in the window being evaluated.
// @param out A collector for emitting elements.
// @throws Exception The function may throw exceptions to fail the program and trigger recovery.
*/
void apply(KEY key, W window, Iterable<IN> input, Collector<OUT> out) throws Exception;
}
Scala 代码:
trait WindowFunction[IN, OUT, KEY, W <: Window] extends Function with Serializable {
/**
// Evaluates the window and outputs none or several elements.
//
// @param key The key for which this window is evaluated.
// @param window The window that is being evaluated.
// @param input The elements in the window being evaluated.
// @param out A collector for emitting elements.
// @throws Exception The function may throw exceptions to fail the program and trigger recovery.
*/
def apply(key: KEY, window: W, input: Iterable[IN], out: Collector[OUT])
}
一个WindowFunction
可以按如下方式来定义和使用:
Java 代码:
DataStream<Tuple2<String, Long>> input = ...;
input
.keyBy(<key selector>)
.window(<window assigner>)
.apply(new MyWindowFunction());
/* ... */
public class MyWindowFunction implements WindowFunction<Tuple<String, Long>, String, String, TimeWindow> {
void apply(String key, TimeWindow window, Iterable<Tuple<String, Long>> input, Collector<String> out) {
long count = 0;
for (Tuple<String, Long> in: input) {
count++;
}
out.collect("Window: " + window + "count: " + count);
}
}
Scala 代码:
val input: DataStream[(String, Long)] = ...
input
.keyBy(<key selector>)
.window(<window assigner>)
.apply(new MyWindowFunction())
/* ... */
class MyWindowFunction extends WindowFunction[(String, Long), String, String, TimeWindow] {
def apply(key: String, window: TimeWindow, input: Iterable[(String, Long)], out: Collector[String]): () = {
var count = 0L
for (in <- input) {
count = count + 1
}
out.collect(s"Window $window count: $count")
}
}
上面的例子展示了统计一个window
中元素个数的WindowFunction
,此外,还将window
的信息添加到输出中。
**注意:**使用WindowFunction
来做简单的聚合操作如计数操作,性能是相当差的。下一章节我们将展示如何将ReduceFunction
跟WindowFunction
结合起来,来获取增量聚合和添加到WindowFunction
中的信息。
ProcessWindowFunction
在使用WindowFunction
的地方你也可以用ProcessWindowFunction
,这跟WindowFunction
很类似,除了接口允许查询跟多关于context
的信息,context
是window
评估发生的地方。
下面是ProcessWindowFunction
的接口:
Java 代码:
public abstract class ProcessWindowFunction<IN, OUT, KEY, W extends Window> implements Function {
/**
// Evaluates the window and outputs none or several elements.
//
// @param key The key for which this window is evaluated.
// @param context The context in which the window is being evaluated.
// @param elements The elements in the window being evaluated.
// @param out A collector for emitting elements.
//
// @throws Exception The function may throw exceptions to fail the program and trigger recovery.
*/
public abstract void process(
KEY key,
Context context,
Iterable<IN> elements,
Collector<OUT> out) throws Exception;
/**
// The context holding window metadata
*/
public abstract class Context {
/**
// @return The window that is being evaluated.
*/
public abstract W window();
}
}
Scala 代码:
abstract class ProcessWindowFunction[IN, OUT, KEY, W <: Window] extends Function {
/**
// Evaluates the window and outputs none or several elements.
//
// @param key The key for which this window is evaluated.
// @param context The context in which the window is being evaluated.
// @param elements The elements in the window being evaluated.
// @param out A collector for emitting elements.
// @throws Exception The function may throw exceptions to fail the program and trigger recovery.
*/
@throws[Exception]
def process(
key: KEY,
context: Context,
elements: Iterable[IN],
out: Collector[OUT])
/**
// The context holding window metadata
*/
abstract class Context {
/**
// @return The window that is being evaluated.
*/
def window: W
}
}
ProcessWindowFunction
可以通过如下方式调用:
Java 代码:
DataStream<Tuple2<String, Long>> input = ...;
input
.keyBy(<key selector>)
.window(<window assigner>)
.process(new MyProcessWindowFunction());`
Scala 代码:
`val input: DataStream[(String, Long)] = ...
input
.keyBy(<key selector>)
.window(<window assigner>)
.process(new MyProcessWindowFunction())
有增量聚合功能的WindowFunction (WindowFunction with Incremental Aggregation)
WindowFunction
可以跟ReduceFunction
或者FoldFunction
结合来增量地对到达window
中的元素进行聚合,当window
关闭之后,WindowFunction
就能提供聚合结果。当获取到WindowFunction
额外的window
元信息后就可以进行增量计算窗口了。
*标注:*你也可以使用ProcessWindowFunction
替换WindowFunction
来进行增量窗口聚合。
使用FoldFunction 进行增量窗口聚合(Incremental Window Aggregation with FoldFunction)
下面的例子展示了一个增量的FoldFunction
如何跟一个WindowFunction
结合,来获取窗口的事件数,并同时返回窗口的key
和窗口的最后时间。
Java 代码:
DataStream<SensorReading> input = ...;
input
.keyBy(<key selector>)
.timeWindow(<window assigner>)
.fold(new Tuple3<String, Long, Integer>("",0L, 0), new MyFoldFunction(), new MyWindowFunction())
// Function definitions
private static class MyFoldFunction
implements FoldFunction<SensorReading, Tuple3<String, Long, Integer> > {
public Tuple3<String, Long, Integer> fold(Tuple3<String, Long, Integer> acc, SensorReading s) {
Integer cur = acc.getField(2);
acc.setField(2, cur + 1);
return acc;
}
}
private static class MyWindowFunction
implements WindowFunction<Tuple3<String, Long, Integer>, Tuple3<String, Long, Integer>, String, TimeWindow> {
public void apply(String key,
TimeWindow window,
Iterable<Tuple3<String, Long, Integer>> counts,
Collector<Tuple3<String, Long, Integer>> out) {
Integer count = counts.iterator().next().getField(2);
out.collect(new Tuple3<String, Long, Integer>(key, window.getEnd(),count));
}
}
Scala 代码:
val input: DataStream[SensorReading] = ...
input
.keyBy(<key selector>)
.timeWindow(<window assigner>)
.fold (
("", 0L, 0),
(acc: (String, Long, Int), r: SensorReading) => { ("", 0L, acc._3 + 1) },
( key: String,
window: TimeWindow,
counts: Iterable[(String, Long, Int)],
out: Collector[(String, Long, Int)] ) =>
{
val count = counts.iterator.next()
out.collect((key, window.getEnd, count._3))
}
)
使用ReduceFunction进行增量窗口聚合(Incremental Window Aggregation with ReduceFunction)
下面例子展示了一个
增量额ReduceFunction
如何跟一个WindowFunction
结合,来获取窗口中最小的事件和窗口的开始时间。
Java 代码:
DataStream<SensorReading> input = ...;
input
.keyBy(<key selector>)
.timeWindow(<window assigner>)
.reduce(new MyReduceFunction(), new MyWindowFunction());
// Function definitions
private static class MyReduceFunction implements ReduceFunction<SensorReading> {
public SensorReading reduce(SensorReading r1, SensorReading r2) {
return r1.value() > r2.value() ? r2 : r1;
}
}
private static class MyWindowFunction
implements WindowFunction<SensorReading, Tuple2<Long, SensorReading>, String, TimeWindow> {
public void apply(String key,
TimeWindow window,
Iterable<SensorReading> minReadings,
Collector<Tuple2<Long, SensorReading>> out) {
SensorReading min = minReadings.iterator().next();
out.collect(new Tuple2<Long, SensorReading>(window.getStart(), min));
}
}
Scala 代码:
val input: DataStream[SensorReading] = ...
input
.keyBy(<key selector>)
.timeWindow(<window assigner>)
.reduce(
(r1: SensorReading, r2: SensorReading) => { if (r1.value > r2.value) r2 else r1 },
( key: String,
window: TimeWindow,
minReadings: Iterable[SensorReading],
out: Collector[(Long, SensorReading)] ) =>
{
val min = minReadings.iterator.next()
out.collect((window.getStart, min))
}
)
链接:https://www.jianshu.com/p/a883262241ef
使用键选择器(keySelector())功能定义key
定义key的另一种方法是“键选择器(keySelector)”功能。键选择器函数将单个元素作为输入并返回元素的键。key可以是任何类型,并且可以从确定性计算中导出。
以下示例显示了一个键选择器函数,它只返回一个对象的字段(即以对象的一个字段进行分组):
// some ordinary POJO
public class WC {public String word; public int count;}
DataStream<WC> words = // [...]
KeyedStream<WC> keyed = words
.keyBy(new KeySelector<WC, String>() {
public String getKey(WC wc) { return wc.word; }
});
flink窗口内数据计算
1. Apache Flink:Keyed Window与Non-Keyed Window
2. flink一次对整个窗口进行聚合操作-ProcessWindowFunction
Flink状态State在代码里管理配置
转载自:Flink状态(state)管理在代码里配置checkpoint
checkPoint简介
-
为了保证state的容错性,Flink需要对state进行checkpoint。
-
Checkpoint是Flink实现容错机制最核心的功能,它能够根据配置周期性地基于Stream中各个Operator/task的状态来生成快照,从而将这些状态数据定期持久化存储下来,当Flink程序一旦意外崩溃时,重新运行程序时可以有选择地从这些快照进行恢复,从而修正因为故障带来的程序数据异常
-
Flink的checkpoint机制可以与(stream和state)的持久化存储交互的前提:
-
持久化的source,它需要支持在一定时间内重放事件。这种sources的典型例子是持久化的消息队列(比如Apache Kafka,RabbitMQ等)或文件系统(比如HDFS,S3,GFS等)
-
用于state的持久化存储,例如分布式文件系统(比如HDFS,S3,GFS等)
-
checkPoint配置
-
默认checkpoint功能是disabled的,想要使用的时候需要先启用
-
checkpoint开启之后,默认的checkPointMode是Exactly-once
-
checkpoint的checkPointMode有两种,Exactly-once和At-least-once
-
Exactly-once对于大多数应用来说是最合适的。At-least-once可能用在某些延迟超低的应用程序(始终延迟为几毫秒)
-
默认checkpoint功能是disabled的,想要使用的时候需要先启用
StreamExecutionEnvironmentenv=StreamExecutionEnvironment.getExecutionEnvironment();
// 每隔1000 ms进行启动一个检查点【设置checkpoint的周期】
env.enableCheckpointing(1000);
// 高级选项:// 设置模式为exactly-once (这是默认值)env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
// 确保检查点之间有至少500 ms的间隔【checkpoint最小间隔】env.getCheckpointConfig().setMinPauseBetweenCheckpoints(500);
// 检查点必须在一分钟内完成,或者被丢弃【checkpoint的超时时间】env.getCheckpointConfig().setCheckpointTimeout(60000);
// 同一时间只允许进行一个检查点
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
// 表示一旦Flink处理程序被cancel后,会保留Checkpoint数据,以便根据实际需要恢复到指定的Checkpoint【详细解释见备注】env.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION:表示一旦Flink处理程序被cancel后,会保留Checkpoint数据,以便根据实际需要恢复到指定的CheckpointExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION:表示一旦Flink处理程序被cancel后,会删除Checkpoint数据,只有job执行失败的时候才会保存checkpoint
State Backend(状态的后端存储)
-
默认情况下,state会保存在taskmanager的内存中,checkpoint会存储在JobManager的内存中。
-
state 的store和checkpoint的位置取决于State Backend的配置
- env.setStateBackend(…)
-
一共有三种State Backend
-
MemoryStateBackend
-
FsStateBackend
-
RocksDBStateBackend
-
MemoryStateBackend
-
state数据保存在java堆内存中,执行checkpoint的时候,会把state的快照数据保存到jobmanager的内存中
-
基于内存的Memory state backend在生产环境下不建议使用
FsStateBackend
-
state数据保存在taskmanager的内存中,执行checkpoint的时候,会把state的快照数据保存到配置的文件系统中
-
可以使用hdfs等分布式文件系统
RocksDBStateBackend
-
RocksDB跟上面的都略有不同,它会在本地文件系统中维护状态,state会直接写入本地rocksdb中。同时它需要配置一个远端的filesystem uri(一般是HDFS),在做checkpoint的时候,会把本地的数据直接复制到filesystem中。fail over的时候从filesystem中恢复到本地
-
RocksDB克服了state受内存限制的缺点,同时又能够持久化到远端文件系统中,比较适合在生产中使用
State Backend使用方式
修改State Backend的两种方式
-
第一种:单任务调整
-
修改当前任务代码
-
env.setStateBackend(new FsStateBackend("hdfs://namenode:9000/flink/checkpoints"));
-
或者new MemoryStateBackend()
-
或者new RocksDBStateBackend(filebackend, true);【需要添加第三方依赖】
-
-
第二种:全局调整
- 修改flink-conf.yaml
state.backend:filesystem
state.checkpoints.dir:hdfs://namenode:9000/flink/checkpoints
-
注意:state.backend的值可以是下面几种:
-
jobmanager(MemoryStateBackend)
-
filesystem(FsStateBackend)
-
rocksdb(RocksDBStateBackend)
-
State backend演示
第一种:单任务调整
启动连接socket zzy:9001的程序
./bin/flink run-m yarn-cluster-yn1-yjm1024-ytm1024-c com.zzy.bigdata.flink.SocketWindowWordCountJavaCheckPointzzy_flink_learn.jar--port9001
[iknow@data-5-63flink-1.7.2]$./bin/flink run-m yarn-cluster-yn1-yjm1024-ytm1024-c com.zzy.bigdata.flink.SocketWindowWordCountJavaCheckPoint zzy_flink_learn.jar--port90012019-03-0612:03:15,057INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli-Found Yarn properties file under/tmp/.yarn-properties-iknow.2019-03-0612:03:15,057INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli-Found Yarn properties file under/tmp/.yarn-properties-iknow.2019-03-0612:03:15,325INFO org.apache.hadoop.yarn.client.RMProxy-ConnectingtoResourceManager at/0.0.0.0:80322019-03-0612:03:15,415INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli-No pathforthe flink jar passed.Using the location ofclassorg.apache.flink.yarn.YarnClusterDescriptortolocate the jar2019-03-0612:03:15,415INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli-No pathforthe flink jar passed.Using the location ofclassorg.apache.flink.yarn.YarnClusterDescriptortolocate the jar2019-03-0612:03:15,421INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli-The argument ynisdeprecatedinwill be ignored.2019-03-0612:03:15,421INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli-The argument ynisdeprecatedinwill be ignored.2019-03-0612:03:15,511INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor-Cluster specification:ClusterSpecification{masterMemoryMB=1024,taskManagerMemoryMB=1024,numberTaskManagers=1,slotsPerTaskManager=1}2019-03-0612:03:15,819WARN org.apache.flink.yarn.AbstractYarnClusterDescriptor-The configurationdirectory('/home/iknow/zhangzhiyong/flink-1.7.2/conf')contains both LOG4JandLogback configuration files.Please deleteorrename one of them.2019-03-0612:03:16,386INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor-Submitting application master application_1551789318445_00042019-03-0612:03:16,412INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl-Submitted application application_1551789318445_00042019-03-0612:03:16,412INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor-Waitingforthe clustertobe allocated2019-03-0612:03:16,414INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor-Deploying cluster,current state ACCEPTED2019-03-0612:03:19,940INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor-YARN application has been deployed successfully.Starting execution of program
如果zzy上未开启9001端口,到jobManager的web ui上看到会报下面的错
代码里设置了checkpoint
//获取flink的运行环境StreamExecutionEnvironmentenv=StreamExecutionEnvironment.getExecutionEnvironment();//默认checkpoint功能是disabled的,想要使用的时候需要先启用;每隔10000ms进行启动一个检查点【设置checkpoint的周期】env.enableCheckpointing(10000);// 高级选项:// 设置模式为exactly-once (这是默认值)env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);// 确保检查点之间有至少500ms的间隔【checkpoint最小间隔】env.getCheckpointConfig().setMinPauseBetweenCheckpoints(500);// 检查点必须在一分钟内完成,或者被丢弃【checkpoint的超时时间】env.getCheckpointConfig().setCheckpointTimeout(60000);// 同一时间只允许进行一个检查点env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);// 表示一旦Flink处理程序被cancel后,会保留Checkpoint数据,以便根据实际需要恢复到指定的Checkpoint【详细解释见备注】env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);//ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION:表示一旦Flink处理程序被cancel后,会保留Checkpoint数据,以便根据实际需要恢复到指定的Checkpoint//ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION: 表示一旦Flink处理程序被cancel后,会删除Checkpoint数据,只有job执行失败的时候才会保存checkpoint//设置statebackend//env.setStateBackend(new MemoryStateBackend());//env.setStateBackend(new FsStateBackend("hdfs://zzy:9000/flink/checkpoints"));//rocksDB需要引入依赖flink-statebackend-rocksdb_2.11//env.setStateBackend(new RocksDBStateBackend("hdfs://zzy:9000/flink/checkpoints",true));env.setStateBackend(newFsStateBackend("hdfs://192.168.5.63:9000/flink/checkpoints"));
但是JobManager的web ui上checkpoint并未触发
报错如下,应该是连接不到zzy 9001,识别不了zzy
选择监听50.63上的9001端口,如果没有nc命令,用
yum install -y nc
安装下,用下面的命令启动flink程序,采用flink on yarn的方式
./bin/flink run-m yarn-cluster-yn1-yjm1024-ytm1024-c com.zzy.bigdata.flink.SocketWindowWordCountJavaCheckPointzzy_flink_learn.jar--port9001
2019-03-06 16:00:24,680 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Deployment took more than 60 seconds. Please check if the requested resources are available in the YARN cluster
如果一直出现Deployment xxx,此时可能是集群上没有资源了,
这里杀掉application_1551789318445_0007和application_1551789318445_0008(这两台是测试机器,资源很紧张)
然后再次重启程序
注意yarn是不是successfully.的状态
Yarn上启动了应用application_1551789318445_0009
点击AM进去jobManager的web ui界面
Checkpoint的UI
可以看到每隔10s进行一次checkpoint
Hdfs上查看checkpoint数据,看到保存了最近10次的checkpoint数据
95d75e802ba1eceefeaf98636e907883跟job ID是对应的
说明flink配置文件conf/flink-conf.yaml里的配置生效了
flink可以保存多个checkpoint,添加如下配置,指定最多需要保存Checkpoint的个数
state.checkpoints.num-retained:10
Flink学习使用
-
计数方式:flink-learning/flink-learning-examples/accumulator: 计数
-
读取告警规则/广播变量/定时线程/
-
AsyncDataStream.unorderedWait(machineData, new AlertRuleAsyncIOFunction()
-
flinkcep
Flink多流合并,共享ValueState
package com.zhisheng.examples.streaming.join;
import org.apache.flink.api.common.functions.JoinFunction;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.RichCoFlatMapFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
public class WindowJoin {
public static void main(String[] args) throws Exception {
final ParameterTool params = ParameterTool.fromArgs(args);
final long windowSize = params.getLong("windowSize", 2000);
final long rate = params.getLong("rate", 3L);
System.out.println("Using windowSize=" + windowSize + ", data rate=" + rate);
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
env.getConfig().setGlobalJobParameters(params);
DataStream<Tuple2<String, Integer>> grades = WindowJoinSampleData.GradeSource.getSource(env, rate);
DataStream<Tuple2<String, Integer>> salaries = WindowJoinSampleData.SalarySource.getSource(env, rate);
KeyedStream<Tuple2<String, Integer>, Tuple> tuple2TupleKeyedStream = grades.keyBy(0);
KeyedStream<Tuple2<String, Integer>, Tuple> tuple2TupleKeyedStream1 = salaries.keyBy(0);
SingleOutputStreamOperator<Tuple3<String, Integer, Integer>> tuple3SingleOutputStreamOperator = tuple2TupleKeyedStream
.connect(tuple2TupleKeyedStream1)
.flatMap(new EnrichmentFunction());
tuple3SingleOutputStreamOperator.print();
runWindowJoin(grades, salaries, windowSize).print().setParallelism(1);
env.execute("Windowed Join Example");
}
public static class EnrichmentFunction extends RichCoFlatMapFunction<Tuple2<String,Integer>, Tuple2<String,Integer>, Tuple3<String, Integer,Integer>> {
// keyed, managed state
private ValueState<Tuple2<String,Integer>> rideState;
private ValueState<Tuple2<String,Integer>> fareState;
@Override
public void open(Configuration config) {
rideState = getRuntimeContext().getState(new ValueStateDescriptor<>("saved ride", TypeInformation.of(new TypeHint<Tuple2<String,Integer>>() {
})));
fareState = getRuntimeContext().getState(new ValueStateDescriptor<>("saved fare", TypeInformation.of(new TypeHint<Tuple2<String,Integer>>() {
})));
}
@Override
public void flatMap1(Tuple2<String,Integer> ride, Collector<Tuple3<String,Integer,Integer>> out) throws Exception {
Tuple2<String,Integer> fare = fareState.value();
if (fare != null) {
fareState.clear();
out.collect(new Tuple3(ride.f0,ride.f1, fare.f1));
} else {
rideState.update(ride);
}
}
@Override
public void flatMap2(Tuple2<String,Integer> fare, Collector<Tuple3<String,Integer,Integer>> out) throws Exception {
Tuple2<String,Integer> ride = rideState.value();
if (ride != null) {
rideState.clear();
out.collect(new Tuple3(ride.f0,ride.f1, fare.f1));
} else {
fareState.update(fare);
}
}
}
public static DataStream<Tuple3<String, Integer, Integer>> runWindowJoin(
DataStream<Tuple2<String, Integer>> grades, DataStream<Tuple2<String, Integer>> salaries, long windowSize) {
return grades.join(salaries)
.where(new NameKeySelector())
.equalTo(new NameKeySelector())
.window(TumblingEventTimeWindows.of(Time.milliseconds(windowSize)))
.apply(new JoinFunction<Tuple2<String, Integer>, Tuple2<String, Integer>, Tuple3<String, Integer, Integer>>() {
@Override
public Tuple3<String, Integer, Integer> join(Tuple2<String, Integer> first, Tuple2<String, Integer> second) {
return new Tuple3<String, Integer, Integer>(first.f0, first.f1, second.f1);
}
});
}
private static class NameKeySelector implements KeySelector<Tuple2<String, Integer>, String> {
@Override
public String getKey(Tuple2<String, Integer> value) {
return value.f0;
}
}
}
Flink之CEP使用
单个模式
<table border="1"> <tr> <th>类型</th> <th>API</th> <th>含义</th> </tr> <tr> <th rowspan="4">量词API</th> <td>times()</td> <td>模式发生次数 示例: pattern.times(2,4),模式发生2,3,4次</td> </tr> <tr> <td>timesOrMore() oneOrMore()</td> <td>模式发生大于等于N次 示例: pattern.timesOrMore(2),模式发生大于等于2次</td> </tr> <tr> <td>optional()</td> <td>模式可以不匹配 示例: pattern.times(2).optional(),模式发生2次或者0次</td> </tr> <tr> <td>greedy()</td> <td>模式发生越多越好 示例: pattern.times(2).greedy(),模式发生2次且重复次数越多越好</td> </tr> <tr> <th rowspan="4">条件API</th> </tr> <tr> <td>where()</td> <td>模式的条件 示例: pattern.where(_.ruleId=43322),模式的条件为ruleId=433322</td> </tr> <tr> <td>or()</td> <td>模式的或条件 示例: pattern.where(_.ruleId=43322).or(_.ruleId=43333),模式条件为ruleId=43322或者43333</td> </tr> <tr> <td>util()</td> <td>模式发生直至X条件满足为止 示例: pattern.oneOrMore().util(condition)模式发生一次或者多次,直至condition满足为止</td> </tr> </table>
Flink匹配跳过策略
问题导读
1.匹配跳过策略的作用是什么? 2.匹配跳过策略有哪四种? 3.匹配跳过策略代码如何实现?
1. 匹配跳过策略介绍
对于给定模式,可以将同一事件分配给多个匹配【成功】。 要控制分配事件的匹配数,需要指定名为AfterMatchSkipStrategy的跳过策略。
跳过策略有四种类型,如下所示:
NO_SKIP
: 每个可能的匹配都被触发。SKIP_PAST_LAST_EVENT
: 丢弃匹配开始后但结束前每个部分匹配。SKIP_TO_FIRST
: 丢弃在匹配开始后但在PatternName的第一个事件发生之前开始的每个部分匹配。SKIP_TO_LAST
: 丢弃在匹配开始后但在PatternNam发生的最后一个事件之前开始的每个部分匹配。
注意,使用SKIP_TO_FIRST
和SKIP_TO_LAST
跳过策略时,还应指定有效的PatternName
。
【注意,这里的匹配跳过策略,其实针对的是循环重复出现的事件,比如SKIP_TO_FIRST,SKIP_TO_NEXT】
例如,对于给定模式b + c和数据流b1 b2 b3 c,这四种跳过策略之间的差异如下:
1.跳过策略:
NO_SKIP
结果: b1 b2 b3 c b2 b3 c b3 c **描述:**找到匹配b1 b2 b3 c后,匹配过程不会丢弃任何结果。
2.跳过策略:
SKIP_TO_NEXT
结果: b1 b2 b3 c b2 b3 c b3 c
**描述:**找到匹配b1 b2 b3 c后,匹配过程不会丢弃任何结果,因为没有其他匹配可以从b1开始。
3.跳过策略: SKIP_PAST_LAST_EVENT
结果: b1 b2 b3 c
描述: 找到匹配b1 b2 b3 c后,匹配过程将丢弃所有已开始的部分匹配。
**4.跳过策略:**SKIP_TO_FIRST
结果: b1 b2 b3 c b2 b3 c b3 c
描述: 找到匹配b1 b2 b3 c后,匹配过程将尝试丢弃在b1之前开始的所有部分匹配,但是没有这样的匹配。 因此,不会丢弃任何东西。
**5.跳过策略:**SKIP_TO_LAST
**结果:**
b1 b2 b3 c b3 c
描述: 找到匹配b1 b2 b3 c后,匹配过程将尝试丢弃在b3之前开始的所有部分匹配。 有一个这样的匹配b2 b3 c
例子
另外看一个例子,以便更好地看到NO_SKIP
和SKIP_TO_FIRST
之间的区别:模式:
(a | b | c)(b | c)c + .greedy d
和序列:a b c1 c2 c3 d
然后结果将是:
1.跳过策略:
NO_SKIP
结果: a b c1 c2 c3 d b c1 c2 c3 d c1 c2 c3 d
描述: 找到匹配b c1 c2 c3 d后,匹配过程不会丢弃任何结果。
2.跳过策略: SKIP_TO_FIRST
结果: a b c1 c2 c3 d c1 c2 c3 d
描述: 找到匹配b c1 c2 c3 d后,匹配过程将丢弃在c1之前开始的所有部分匹配。 有一个这样的匹配b c1 c2 c3 d。
为了更好地理解NO_SKIP和SKIP_TO_NEXT之间的区别,请看下面的例子:模式:a b +和序列:a b1 b2 b3然后结果将是: 1.跳过策略: NO_SKIP
结果: a b1 a b1 b2 a b1 b2 b3
描述: 找到匹配b1后,匹配过程不会丢弃任何结果。
2.跳过策略: SKIP_TO_NEXT
结果: a b1
描述: 找到匹配b1后,匹配过程将丢弃从a开始的所有部分匹配。 这意味着既不能生成b1 b2也不生成b1 b2 b3。 【这里我们看到SKIP_TO_NEXT,其实是对于重复的忽略而已】
要指定要使用的跳过策略,只需通过调用以下命令创建AfterMatchSkipStrategy:
2.匹配跳过策略编程
1.函数:AfterMatchSkipStrategy.noSkip()
**描述: **创建NO_SKIP跳过策略
2.函数
AfterMatchSkipStrategy.skipToNext()
描述创建SKIP_TO_NEXT跳过策略
3.函数AfterMatchSkipStrategy.skipPastLastEvent()
描述 创建SKIP_PAST_LAST_EVENT跳过策略 4.函数AfterMatchSkipStrategy.skipToFirst(patternName)
描述 使用引用的模式名称patternName创建SKIP_TO_FIRST跳过策略
5.函数AfterMatchSkipStrategy.skipToLast(patternName)
描述 使用引用的模式名称patternName创建SKIP_TO_LAST跳过策略 通过调用将跳过策略应用于模式:
IterativeConditions--迭代条件:
这是最常见的条件类型。 你可以指定一个条件,该条件基于先前接受的事件的属性或其子集的统计信息来接受后续事件。
下面代码说的是:如果名称以“foo”开头同时如果该模式的先前接受的事件的价格总和加上当前事件的价格不超过该值 5.0,则迭代条件接受名为“middle”的模式的下一个事件,。 迭代条件可以很强大的,尤其是与循环模式相结合,例如, oneOrMore()。
middle.oneOrMore().where(new IterativeCondition<SubEvent>() {
@Override
public boolean filter(SubEvent value, Context<SubEvent> ctx) throws Exception {
if (!value.getName().startsWith("foo")) {
return false;
}
double sum = value.getPrice();
for (Event event : ctx.getEventsForPattern("middle")) {
sum += event.getPrice();
}
return Double.compare(sum, 5.0) < 0;
}
});
注意对context.getEventsForPattern(...)的调用,将为给定潜在匹配项查找所有先前接受的事件。 此操作的代价可能会变化巨大,因此在使用条件时,请尽量减少其使用。
SimpleConditions- 简单条件
这种类型的条件扩展了前面提到的IterativeCondition类,并且仅根据事件本身的属性决定是否接受事件。
start.where(new SimpleCondition<Event>() {
@Override
public boolean filter(Event value) {
return value.getName().startsWith("foo");
}
});
最后,还可以通过pattern.subtype(subClass)方法将接受事件的类型限制为初始事件类型的子类型。
start.subtype(SubEvent.class).where(new SimpleCondition<SubEvent>() {
@Override
public boolean filter(SubEvent value) {
return ... // some condition
}
});
[Flink SQL的window操作](
来源:oschina
链接:https://my.oschina.net/112612/blog/3215767