Can Flink write results into multiple files (like Hadoop's MultipleOutputFormat)?

后端未结

关注

 2  1059

时光说笑

I\'m using Apache Flink\'s DataSet API. I want to implement a job that writes multiple results into different files.

How can I do that?

相关标签:

2条回答

臣服心动

2021-01-06 13:16

You can use HadoopOutputFormat API in Flink like this:

class IteblogMultipleTextOutputFormat[K, V] extends MultipleTextOutputFormat[K, V] {
override def generateActualKey(key: K, value: V): K =
  NullWritable.get().asInstanceOf[K]

override def generateFileNameForKeyValue(key: K, value: V, name: String): String =
  key.asInstanceOf[String]
}

and we can using IteblogMultipleTextOutputFormat as follow:

val multipleTextOutputFormat = new IteblogMultipleTextOutputFormat[String, String]()
val jc = new JobConf()
FileOutputFormat.setOutputPath(jc, new Path("hdfs:///user/iteblog/"))
val format = new HadoopOutputFormat[String, String](multipleTextOutputFormat,   jc)
val batch = env.fromCollection(List(("A", "1"), ("A", "2"), ("A", "3"),
  ("B", "1"), ("B", "2"), ("C", "1"), ("D", "2")))
batch.output(format)

for more information you can see:http://www.iteblog.com/archives/1667

0 讨论(0)

离开以前

2021-01-06 13:37
You can add as many data sinks to a DataSet program as you need.

For example in a program like this:
```
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

DataSet<Tuple3<String, Long, Long>> data = env.readFromCsv(...);
// apply MapFunction and emit
data.map(new YourMapper()).writeToText("/foo/bar");
// apply FilterFunction and emit
data.filter(new YourFilter()).writeToCsv("/foo/bar2");
```
You read a DataSet data from a CSV file. This data is given to two subsequent transformations:
1. To a MapFunction and its result is written to a text file.
2. To a FilterFunction and the non-filtered tuples are written to a CSV file.
You can also have multiple data source and branch and merge data sets (using union, join, coGroup, cross, or broadcast sets) as you like.
0 讨论(0)
发布评论:

提交评论
- 加载中...