sortPartition method of a dataset sorts the dataset locally based on some specified fields. How can I get my large Dataset sorted globally in an efficient way in Flink?
This is currently not easily possible because Flink does not provide a built-in range partitioning strategy, yet.
A work-around is to implement a custom Partitioner
:
DataSet<Tuple2<Long, Long>> data = ...
data
.partitionCustom(new Partitioner<Long>() {
int partition(Long key, int numPartitions) {
// your implementation
}
}, 0)
.sortPartition(0, Order.ASCENDING)
.writeAsText("/my/output");
Note: In order to achieve balanced partitions with a custom partitioner, you need to know about the value range and distribution of the key.
Support for a range partitioner (with automatic sampling) in Apache Flink is currently work in progress and should be available soon.
Edit (June 7th, 2016): Range partitioning was added to Apache Flink with version 1.0.0. You can globally sort a data set as follows:
DataSet<Tuple2<Long, Long>> data = ...
data
.partitionByRange(0)
.sortPartition(0, Order.ASCENDING)
.writeAsText("/my/output");
Note that range partitioning samples the input data set to compute a data distribution for equally-sized partitions.