How to bucket outputs in Scalding

前端未结

关注

 1  514

萌比男神i 2021-01-24 12:57

I\'m trying to output a pipe into different directories such that the output of each directory will be bucketed based on some ids. So in a plain map reduce code I would use the

1条回答

清酒与你 (楼主)

2021-01-24 13:35
Yes, of course there is a better way using TemplatedTsv.

So your code above can be written as follows,
```
val somePipe = Tsv(in, fields = someSchema, skipHeader = true)
    .read
    .write(TemplatedTsv(out, "%s", 'some_id, writeHeader = true))
```
This will put all records coming from 'some_id into separate folders under out/some_ids folder.

However, you can also create integer buckets. Just change the last lines,
```
.map('some_id -> 'bucket) { id: String => id.hashCode % numberOfBuckets }    
.write(TemplatedTsv(out, "%02d", 'bucket, writeHeader = true, fields = ('all except 'bucket)))
```
This will create two digit folders as out/dd/. You can also check templatedTsv api here.

There might be small problem using templatedTsv, that is reducers can generate lots of small files which can be bad for the next job using your results. Therefore, it is better to sort on template fields before writing to disk. I wrote a blog about about it here.
0 讨论(0)
发布评论:

提交评论
- 加载中...