How to bucket outputs in Scalding

我只是一个虾纸丫 提交于 2019-12-02 07:06:13

问题


I'm trying to output a pipe into different directories such that the output of each directory will be bucketed based on some ids. So in a plain map reduce code I would use the MultipleOutputs class and I would do something like this in the reducer.

protected void reduce(final SomeKey key,
      final Iterable<SomeValue> values,
      final Context context) {

   ...
   for (SomeValue value: values) {
     String bucketId = computeBucketIdFrom(...);
     multipleOutputs.write(key, value, folderName + "/" + bucketId);
   ...

So i guess one could do it like this in scalding

...
  val somePipe = Csv(in, separator = "\t",
        fields = someSchema,
        skipHeader = true)
    .read

  for (i <- 1 until numberOfBuckets) {
    somePipe
    .filter('someId) {id: String => (id.hashCode % numberOfBuckets) == i}
    .write(Csv(out + "/bucket" + i ,
      writeHeader = true,
      separator = "\t"))
  }

But I feel that you would end up reding the same pipe many times and it will affect the overall performance.

Is there any other alternatives?

Thanks


回答1:


Yes, of course there is a better way using TemplatedTsv.

So your code above can be written as follows,

val somePipe = Tsv(in, fields = someSchema, skipHeader = true)
    .read
    .write(TemplatedTsv(out, "%s", 'some_id, writeHeader = true))

This will put all records coming from 'some_id into separate folders under out/some_ids folder.

However, you can also create integer buckets. Just change the last lines,

.map('some_id -> 'bucket) { id: String => id.hashCode % numberOfBuckets }    
.write(TemplatedTsv(out, "%02d", 'bucket, writeHeader = true, fields = ('all except 'bucket)))

This will create two digit folders as out/dd/. You can also check templatedTsv api here.

There might be small problem using templatedTsv, that is reducers can generate lots of small files which can be bad for the next job using your results. Therefore, it is better to sort on template fields before writing to disk. I wrote a blog about about it here.



来源:https://stackoverflow.com/questions/28357987/how-to-bucket-outputs-in-scalding

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!