Scalding: How to retain the other field, after a groupBy('field){.size}?

后端 未结 1 1192
陌清茗
陌清茗 2021-01-05 16:02

So my input data has two fields/columns: id1 & id2, and my code is the following:

TextLine(args(\"input\"))
.read
.mapTo(\'line->(\'id1,\'id2)) {line:         


        
相关标签:
1条回答
  • 2021-01-05 16:42

    You can't do this in a nice way I'm afraid. Think about how it works under the hood - it splits the data to be counted into chunks and sends it off to different processes, each process counts it's chunk, then a single reducer adds them all up at the end. While each process is counting it doesn't know the entire size so it can't add the field on. The only way is to go back and add it to the data once the entire size is known (i.e. a join).

    If each group fits in memory (and you can configure the memory), you can:

    Tsv(args("input"), ('id1, 'id2))
    .groupBy('id2)(_.size.toList[(String, String)](('id1, 'id2) -> 'list))
    .flatMapTo[(Iterable[(String, String)], Int), (String, String, Int)](('list, 'size) -> ('id1, 'id2, 'size)) {
      case (list, size) => list.map(record => (record._1, record._2, size))
    }
    .write(Tsv(args("output")))
    

    But if your system doesn't have enough memory, you will have to use an expensive join.

    Remark: You can use Tsv instead of TextLine followed by mapTo and splitting.

    0 讨论(0)
提交回复
热议问题