I\'m trying to output a pipe into different directories such that the output of each directory will be bucketed based on some ids. So in a plain map reduce code I would use the
Yes, of course there is a better way using TemplatedTsv.
So your code above can be written as follows,
val somePipe = Tsv(in, fields = someSchema, skipHeader = true)
.read
.write(TemplatedTsv(out, "%s", 'some_id, writeHeader = true))
This will put all records coming from 'some_id into separate folders under out/some_ids folder.
However, you can also create integer buckets. Just change the last lines,
.map('some_id -> 'bucket) { id: String => id.hashCode % numberOfBuckets }
.write(TemplatedTsv(out, "%02d", 'bucket, writeHeader = true, fields = ('all except 'bucket)))
This will create two digit folders as out/dd/. You can also check templatedTsv api here.
There might be small problem using templatedTsv, that is reducers can generate lots of small files which can be bad for the next job using your results. Therefore, it is better to sort on template fields before writing to disk. I wrote a blog about about it here.