Scalding: How to retain the other field, after a groupBy('field){.size}?

二次信任 提交于 2019-11-30 05:07:18

问题


So my input data has two fields/columns: id1 & id2, and my code is the following:

TextLine(args("input"))
.read
.mapTo('line->('id1,'id2)) {line: String =>
    val fields = line.split("\t")
        (fields(0),fields(1))
}
.groupBy('id2){.size}
.write(Tsv(args("output")))

The output results in (what i assume) two fields: id2 * size. I'm a little stuck on finding out if it is possible to retain the id1 value that was also grouped with id2 and add it as another field?


回答1:


You can't do this in a nice way I'm afraid. Think about how it works under the hood - it splits the data to be counted into chunks and sends it off to different processes, each process counts it's chunk, then a single reducer adds them all up at the end. While each process is counting it doesn't know the entire size so it can't add the field on. The only way is to go back and add it to the data once the entire size is known (i.e. a join).

If each group fits in memory (and you can configure the memory), you can:

Tsv(args("input"), ('id1, 'id2))
.groupBy('id2)(_.size.toList[(String, String)](('id1, 'id2) -> 'list))
.flatMapTo[(Iterable[(String, String)], Int), (String, String, Int)](('list, 'size) -> ('id1, 'id2, 'size)) {
  case (list, size) => list.map(record => (record._1, record._2, size))
}
.write(Tsv(args("output")))

But if your system doesn't have enough memory, you will have to use an expensive join.

Remark: You can use Tsv instead of TextLine followed by mapTo and splitting.



来源:https://stackoverflow.com/questions/17507526/scalding-how-to-retain-the-other-field-after-a-groupbyfield-size

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!