问题
I'm trying to output a pipe into different directories such that the output of each directory will be bucketed based on some ids. So in a plain map reduce code I would use the MultipleOutputs class and I would do something like this in the reducer.
protected void reduce(final SomeKey key,
final Iterable<SomeValue> values,
final Context context) {
...
for (SomeValue value: values) {
String bucketId = computeBucketIdFrom(...);
multipleOutputs.write(key, value, folderName + "/" + bucketId);
...
So i guess one could do it like this in scalding
...
val somePipe = Csv(in, separator = "\t",
fields = someSchema,
skipHeader = true)
.read
for (i <- 1 until numberOfBuckets) {
somePipe
.filter('someId) {id: String => (id.hashCode % numberOfBuckets) == i}
.write(Csv(out + "/bucket" + i ,
writeHeader = true,
separator = "\t"))
}
But I feel that you would end up reding the same pipe many times and it will affect the overall performance.
Is there any other alternatives?
Thanks
回答1:
Yes, of course there is a better way using TemplatedTsv.
So your code above can be written as follows,
val somePipe = Tsv(in, fields = someSchema, skipHeader = true)
.read
.write(TemplatedTsv(out, "%s", 'some_id, writeHeader = true))
This will put all records coming from 'some_id into separate folders under out/some_ids folder.
However, you can also create integer buckets. Just change the last lines,
.map('some_id -> 'bucket) { id: String => id.hashCode % numberOfBuckets }
.write(TemplatedTsv(out, "%02d", 'bucket, writeHeader = true, fields = ('all except 'bucket)))
This will create two digit folders as out/dd/. You can also check templatedTsv api here.
There might be small problem using templatedTsv, that is reducers can generate lots of small files which can be bad for the next job using your results. Therefore, it is better to sort on template fields before writing to disk. I wrote a blog about about it here.
来源:https://stackoverflow.com/questions/28357987/how-to-bucket-outputs-in-scalding