Can Flink write results into multiple files (like Hadoop's MultipleOutputFormat)?

后端 未结 2 1059
时光说笑
时光说笑 2021-01-06 12:39

I\'m using Apache Flink\'s DataSet API. I want to implement a job that writes multiple results into different files.

How can I do that?

相关标签:
2条回答
  • 2021-01-06 13:16

    You can use HadoopOutputFormat API in Flink like this:

    class IteblogMultipleTextOutputFormat[K, V] extends MultipleTextOutputFormat[K, V] {
    override def generateActualKey(key: K, value: V): K =
      NullWritable.get().asInstanceOf[K]
    
    override def generateFileNameForKeyValue(key: K, value: V, name: String): String =
      key.asInstanceOf[String]
    }
    

    and we can using IteblogMultipleTextOutputFormat as follow:

    val multipleTextOutputFormat = new IteblogMultipleTextOutputFormat[String, String]()
    val jc = new JobConf()
    FileOutputFormat.setOutputPath(jc, new Path("hdfs:///user/iteblog/"))
    val format = new HadoopOutputFormat[String, String](multipleTextOutputFormat,   jc)
    val batch = env.fromCollection(List(("A", "1"), ("A", "2"), ("A", "3"),
      ("B", "1"), ("B", "2"), ("C", "1"), ("D", "2")))
    batch.output(format)
    

    for more information you can see:http://www.iteblog.com/archives/1667

    0 讨论(0)
  • 2021-01-06 13:37

    You can add as many data sinks to a DataSet program as you need.

    For example in a program like this:

    ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    
    DataSet<Tuple3<String, Long, Long>> data = env.readFromCsv(...);
    // apply MapFunction and emit
    data.map(new YourMapper()).writeToText("/foo/bar");
    // apply FilterFunction and emit
    data.filter(new YourFilter()).writeToCsv("/foo/bar2");
    

    You read a DataSet data from a CSV file. This data is given to two subsequent transformations:

    1. To a MapFunction and its result is written to a text file.
    2. To a FilterFunction and the non-filtered tuples are written to a CSV file.

    You can also have multiple data source and branch and merge data sets (using union, join, coGroup, cross, or broadcast sets) as you like.

    0 讨论(0)
提交回复
热议问题