How to bucket outputs in Scalding

前端 未结 1 514
萌比男神i
萌比男神i 2021-01-24 12:57

I\'m trying to output a pipe into different directories such that the output of each directory will be bucketed based on some ids. So in a plain map reduce code I would use the

1条回答
  •  清酒与你
    2021-01-24 13:35

    Yes, of course there is a better way using TemplatedTsv.

    So your code above can be written as follows,

    val somePipe = Tsv(in, fields = someSchema, skipHeader = true)
        .read
        .write(TemplatedTsv(out, "%s", 'some_id, writeHeader = true))
    

    This will put all records coming from 'some_id into separate folders under out/some_ids folder.

    However, you can also create integer buckets. Just change the last lines,

    .map('some_id -> 'bucket) { id: String => id.hashCode % numberOfBuckets }    
    .write(TemplatedTsv(out, "%02d", 'bucket, writeHeader = true, fields = ('all except 'bucket)))
    

    This will create two digit folders as out/dd/. You can also check templatedTsv api here.

    There might be small problem using templatedTsv, that is reducers can generate lots of small files which can be bad for the next job using your results. Therefore, it is better to sort on template fields before writing to disk. I wrote a blog about about it here.

    0 讨论(0)
提交回复
热议问题