apache beam 初探--java篇

——————————————
版权声明：本文为博主「henyu」的原创文章，遵循CC 4.0 by-sa版权协议，转载请附上原文出处链接及本声明。
原文链接：https://i.cnblogs.com/EditPosts.aspx?postid=11430012

一、概述
在大数据的浪潮之下，技术的更新迭代十分频繁。受技术开源的影响，大数据开发者提供了十分丰富的工具。但也因为如此，增加了开发者选择合适工具的难度。在大数据处理一些问题的时候，往往使用的技术是多样化的。这完全取决于业务需求，比如进行批处理的MapReduce，实时流处理的Flink，以及SQL交互的Spark SQL等等。而把这些开源框架，工具，类库，平台整合到一起，所需要的工作量以及复杂度，可想而知。这也是大数据开发者比较头疼的问题。而今天要分享的就是整合这些资源的一个解决方案，它就是 Apache Beam。

Beam是一个统一的编程框架，支持批处理和流处理，并可以将用Beam编程模型构造出来的程序，在多个计算引擎（Apache Apex, Apache Flink, Apache Spark, Google Cloud Dataflow等）上运行。

本文重点不在于讲解 apache beam 的优缺点及应用前景，着重在于为初识beam ，而不知道怎么入门编写代码的朋友抛转引玉。

二、apache beam 是什么

网上关于apache beam 的介绍很多，在这里我就不介绍了，有兴趣的可参阅下面链接

https://blog.csdn.net/qq_34777600/article/details/87165765 (原文出自：一只IT小小鸟)

https://www.cnblogs.com/bigben0123/p/9590489.html （来源于 张海涛，目前就职于海康威视云基础平台，负责云计算大数据的基础架构设计和中间件的开发，专注云计算大数据方向。Apache Beam 中文社区发起人之一，如果想进一步了解最新 Apache Beam 动态和技术研究成果，请加微信 cyrjkj 入群共同研究和运用）

三、代码入门

示例一、读写文件 TextIO

<dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-sdks-java-core</artifactId>
            <version>${beam.version}</version>
            <!--<scope>provided</scope>-->
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.beam/beam-runners-direct-java -->
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-runners-direct-java</artifactId>
            <version>${beam.version}</version>
            <!--<scope>provided</scope>-->
        </dependency>

<dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-runners-core-java</artifactId>
            <version>${kafka.version}</version>
            <!--<scope>provided</scope>-->
        </dependency>

/**
     * 读写文件 TextIO
     *
     * @param
     */
    public static void TextIo() {
        //创建管道工厂
        PipelineOptions pipelineOptions = PipelineOptionsFactory.create();
        //设置运行的模型，现在一共有3种
        pipelineOptions.setRunner(DirectRunner.class);
        //设置相应的管道
        Pipeline pipeline = Pipeline.create(pipelineOptions);
        //根据文件路径读取文件内容
        pipeline.apply(TextIO.read().from("C:\\bigdata\\apache_beam\\src\\main\\resources\\abc"))
                .apply("ExtractWords", ParDo.of(new DoFn<String, String>() {
                    @ProcessElement
                    public void processElement(ProcessContext c) {
                        //根据空格读取数据
                        for (String word : c.element().split(" ")) {
                            if (!word.isEmpty()) {
                                c.output(word);
                                System.out.println("读文件中的数据:" + word);
                            }
                        }
                    }
                })).apply(Count.<String>perElement())
                .apply("formatResult", MapElements.via(new SimpleFunction<KV<String, Long>, String>() {
                    @Override
                    public String apply(KV<String, Long> input) {
                        return input.getKey() + " : " + input.getValue();
                    }
                }))
                .apply(TextIO.write().to("C:\\bigdata\\apache_beam\\src\\main\\resources")); //进行输出到文件夹下面

        pipeline.run().waitUntilFinish();

    }

示例二、启用flink作为计算引擎、整合kafka ，以流式数据窗口的方式，计算kafka数据

引入相关依赖

<dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-sdks-java-core</artifactId>
            <version>${beam.version}</version>
            <!--<scope>provided</scope>-->
        </dependency>
 <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-sdks-java-io-kafka</artifactId>
            <version>${beam.version}</version>
            <!--<scope>provided</scope>-->
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients -->
        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka-clients</artifactId>
            <version>${kafka.version}</version>
            <!--<scope>provided</scope>-->
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.beam/beam-runners-core-java -->
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-runners-core-java</artifactId>
            <version>${kafka.version}</version>
            <!--<scope>provided</scope>-->
        </dependency>
<dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-java</artifactId>
            <version>${flink.version}</version>
            <!--<scope>provided</scope>-->
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients_2.11</artifactId>
            <version>${flink.version}</version>
            <!--<scope>provided</scope>-->
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-core</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-runtime_2.11</artifactId>
            <version>${flink.version}</version>
            <!--<scope>provided</scope>-->
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-java_2.11</artifactId>
            <version>${flink.version}</version>
            <!--<scope>provided</scope>-->
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-metrics-core</artifactId>
            <version>${flink.version}</version>
            <!--<scope>provided</scope>-->
        </dependency>

核心代码：

/**
     * flink
     * 读写kafka数据
     * flinkRunner
     * @param
     */
    public static void flinkKafka() {
        FlinkPipelineOptions options = PipelineOptionsFactory.as(FlinkPipelineOptions.class);
        // 显式指定PipelineRunner：FlinkRunner必须指定如果不制定则为本地
        options.setStreaming(true);
        options.setAppName("app_test");
        options.setJobName("flinkjob");
        options.setFlinkMaster("local");
        options.setParallelism(10);
        //创建flink管道
        Pipeline pipeline = Pipeline.create(options);
        //指定KafkaIO的模型，从源码中不难看出这个地方的KafkaIO<K,V>类型是String和String 类型，也可以换成其他类型。
        PCollection<KafkaRecord<String, String>> lines =
                pipeline.apply(KafkaIO.<String, String>read()
                                //设置Kafka集群的集群地址
                                .withBootstrapServers(kafkaBootstrapServers)
                                //设置Kafka的主题类型，源码中使用了单个主题类型，如果是多个主题类型则用withTopics(List<String>)方法进行设置。
                                // 设置情况基本跟Kafka原生是一样的
                                .withTopic(inputTopic)
                                //设置序列化类型
                                .withKeyDeserializer(StringDeserializer.class)
                                .withValueDeserializer(StringDeserializer.class)
                                //设置Kafka的消费者属性，这个地方还可以设置其他的属性。源码中是针对消费分组进行设置。
                                .withConsumerConfigUpdates(ImmutableMap.<String, Object>of("auto.offset.reset", "latest"))
                /*//设置Kafka吞吐量的时间戳，可以是默认的，也可以自定义
                .withLogAppendTime()
                *//**
                         * 相当于Kafka 中"isolation.level", "read_committed" ，指定KafkaConsumer只应读取非事务性消息，或从其输入主题中提交事务性消息。
                         * 流处理应用程序通常在多个读取处理写入阶段处理其数据，每个阶段使用前一阶段的输出作为其输入。
                         * 通过指定read_committed模式，我们可以在所有阶段完成一次处理。针对"Exactly-once" 语义，支持Kafka 0.11版本。
                         *//*
                .withReadCommitted()
                //设置Kafka是否自动提交属性"AUTO_COMMIT"，默认为自动提交，使用Beam 的方法来设置
                .commitOffsetsInFinalize()
                //设置是否返回Kafka的其他数据，例如offset 信息和分区信息，不用可以去掉
                .withoutMetadata()
                //设置只返回values值，不用返回key*/
                );

        //kafka数据获取
        PCollection<String> kafkadata = lines.apply("Remove Kafka Metadata", ParDo.of(new DoFn<KafkaRecord<String, String>, String>() {
            @ProcessElement
            public void processElement(ProcessContext c) {
                System.out.println("输出的分区为----：" + c.element().getKV());
                c.output(c.element().getKV().getValue());
            }
        }));

       //kafka数据处理
        PCollection<String> wordCount = kafkadata
                .apply(Window.<String>into(FixedWindows.of(Duration.standardSeconds(5))))
                .apply(Count.<String>perElement())
                .apply("ConcatResultKV", MapElements.via(new SimpleFunction<KV<String, Long>, String>() {
                    // 拼接最后的格式化输出（Key为Word，Value为Count）
                    @Override
                    public String apply(KV<String, Long> input) {
                        System.out.println("进行统计：" + input.getKey() + ": " + input.getValue());
                        return input.getKey() + ": " + input.getValue();
                    }
                }));
        //kafka 处理后的数据发送回kafka
        wordCount.apply(KafkaIO.<Void, String>write()
                        .withBootstrapServers(kafkaBootstrapServers)
                        .withTopic(outputTopic)
                        //不需要设置，类型为void
//                .withKeySerializer(VoidDeserializer.class)
                        .withValueSerializer(StringSerializer.class)
                        .values()
        );
        pipeline.run().waitUntilFinish();

    }

示例三：spark作为runner ,读取kafka流式数据，窗口时间，处理结果放回kafka

依赖，将示例二差不多

<dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-runners-spark</artifactId>
            <version>${beam.version}</version>
        </dependency>

核心代码

/**
     * 采用spark 作为runner
     * 消费kafka数据
     */
    public static void sparkKafka() {
        //创建管道工厂
        SparkPipelineOptions options = PipelineOptionsFactory.as(SparkPipelineOptions.class);
        //参数设置
        options.setSparkMaster("local[*]");
        options.setAppName("spark-beam");
        options.setCheckpointDir("/user/chickpoint16");
        //创建管道
        Pipeline pipeline = Pipeline.create(options);
        //读取kafka数据
        PCollection<KafkaRecord<String, String>> lines = pipeline.apply(KafkaIO.<String, String>read()
                //设置kafka地址
                .withBootstrapServers(kafkaBootstrapServers)
                //设置连接主题
                .withTopic(inputTopic)
                //设置序列化
                .withKeyDeserializer(StringDeserializer.class)
                .withValueDeserializer(StringDeserializer.class)
                //设置Kafka的消费者属性，这个地方还可以设置其他的属性。源码中是针对消费分组进行设置。
                .withConsumerConfigUpdates(ImmutableMap.<String, Object>of("auto.offset.reset", " latest"))

        );
       //数据处理
        PCollection<String> wordcount = lines.apply("split data",ParDo.of(new DoFn<KafkaRecord<String, String>,String>() {
            @ProcessElement
            public void processElement(ProcessContext c){
                String[] arr=c.element().getKV().getValue().split(" ");
                for(String value :arr){
                    if(!value.isEmpty()){
                        c.output(value);
                    }
                }

            }
        })).apply(Window.<String>into(FixedWindows.of(Duration.standardSeconds(5))))
                .apply(Count.<String>perElement())
                .apply("wordcount",MapElements.via(new SimpleFunction<KV<String,Long>,String>(){
                    @Override
                    public String apply(KV<String,Long> input){
                        System.out.println(input.getKey()+" : "+input.getValue());
                        System.err.println("===============================================");
                        return input.getKey()+" : "+input.getValue();
                    }
                }));
        System.out.println(wordcount);
        //kafka 处理后的数据发送回kafka
        wordcount.apply(KafkaIO.<Void, String>write()
                .withBootstrapServers(kafkaBootstrapServers)
                .withTopic(outputTopic)
                .withValueSerializer(StringSerializer.class)
                .values()
        );
        pipeline.run().waitUntilFinish();

    }

示例四：HBaseIO

依赖

<dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-sdks-java-io-hbase</artifactId>
            <version>${beam.version}</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-client -->
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>${hbase.version}</version>
        </dependency>

代码：

/**
     * HBaseIo beam
     * 采用apache beam的方式读取hbase 数据
     */
    public static void  getHbaseData(){
        //创建管道工厂
//        SparkPipelineOptions options = PipelineOptionsFactory.as(SparkPipelineOptions.class);
//        options.setJobName("read mongo");
//        options.setSparkMaster("local[*]");
//        options.setCheckpointDir("/user/chickpoint17");
        PipelineOptions options = PipelineOptionsFactory.create();
        options.setRunner(DirectRunner.class);
        config = HBaseConfiguration.create();
        config.set("hbase.zookeeper.property.clientPort", hbase_clientPort);
        config.set("hbase.zookeeper.quorum", hbase_zookeeper_quorum);
        config.set("zookeeper.znode.parent", zookeeper_znode_parent);
        config.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");
        config.setInt("hbase.rpc.timeout", 20000);
        config.setInt("hbase.client.operation.timeout", 30000);
        config.setInt("hbase.client.scanner.timeout.period", 2000000);
        //创建管道
        Pipeline pipeline = Pipeline.create(options);
        PCollection<Result> result = pipeline.apply(HBaseIO.read()
                .withConfiguration(config)
                .withTableId(hbase_table)
                .withKeyRange("001".getBytes(),"004".getBytes())
        );
        PCollection<String> process = result.apply("process", ParDo.of(new DoFn<Result, String>() {
            @ProcessElement
            public void processElement(ProcessContext c) {
                String row = Bytes.toString(c.element().getRow());
                List<Cell> cells = c.element().listCells();
                for (Cell cell:cells){
                    String family = Bytes.toString(cell.getFamilyArray(),cell.getFamilyLength(),cell.getFamilyOffset());
                    String column = Bytes.toString(cell.getQualifierArray(),cell.getQualifierOffset(),cell.getQualifierLength());
                    String value= Bytes.toString(cell.getValueArray(),cell.getValueOffset(),cell.getValueLength());
                    System.out.println(family);
                    c.output(row+"------------------ "+family+" : "+column+" = "+value);
                    System.out.println(row+"------------------ "+family+" : "+column+" = "+value);
                }
            }
        }));

        pipeline.run().waitUntilFinish();
    }