Write to multiple outputs by key Scalding Hadoop, one MapReduce Job

杀马特。学长 韩版系。学妹 提交于 2019-12-05 00:04:03

问题


How can you write to multiple outputs dependent on the key using Scalding(/cascading) in a single Map Reduce Job. I could of course use .filter for all the possible keys, but that is a horrible hack, which will fire up many jobs.


回答1:


There is TemplatedTsv in Scalding (from version 0.9.0rc16 and up), exactly same as Cascading TemplateTsv.

Tsv(args("input"), ('COUNTRY, 'GDP))
.read
.write(TemplatedTsv(args("output"), "%s", 'COUNTRY))
// it will create a directory for each country under "output" path in Hadoop mode.



回答2:


Use MultipleOutputFormat and extrapolate from these other SO questions to write a custom output class using the output format: Create Scalding Source like TextLine that combines multiple files into single mappers, Compress Output Scalding / Cascading TsvCompressed




回答3:


This suggestion on the Cascading User group suggests to use Cascading TemplateTap. Not sure how to connect this to Scalding though.



来源:https://stackoverflow.com/questions/23994383/write-to-multiple-outputs-by-key-scalding-hadoop-one-mapreduce-job

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!