Consume GCS files based on pattern from Flink

空扰寡人 提交于 2019-12-12 18:24:29

问题


Since Flink supports the Hadoop FileSystem abstraction, and there's a GCS connector - library that implements it on top of Google Cloud Storage.

How do I create a Flink file source using the code in this repo?


回答1:


To achieve this you need to:

  1. Install and configure GCS connector on your Flink cluster.
  2. Add Hadoop and Flink dependencies (including HDFS connector) to your project:
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-scala_2.11</artifactId>
        <version>${flink.version}</version>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-connector-filesystem_2.11</artifactId>
        <version>${flink.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-hadoop-compatibility_2.11</artifactId>
        <version>${flink.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>${hadoop.version}</version>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-mapreduce-client-core</artifactId>
        <version>${hadoop.version}</version>
        <scope>provided</scope>
    </dependency>
    
  3. Use it to create data source with GCS path:

    import org.apache.flink.api.java.DataSet;
    import org.apache.flink.api.java.ExecutionEnvironment;
    import org.apache.flink.api.java.tuple.Tuple2;
    import org.apache.flink.hadoopcompatibility.HadoopInputs;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapred.TextInputFormat;
    
    ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    
    DataSet<Tuple2<LongWritable, Text>> input =
        env.createInput(
            HadoopInputs.readHadoopFile(
                new TextInputFormat(), LongWritable.class, Text.class, "gs://bucket/path/some*pattern/"));
    


来源:https://stackoverflow.com/questions/55612471/consume-gcs-files-based-on-pattern-from-flink

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!