How to group by multiple keys in spark?

前端 未结 2 596
既然无缘
既然无缘 2021-01-02 19:00

I have a bunch of tuples which are in form of composite keys and values. For example,

tfile.collect() = [((\'id1\',\'pd1\',\'t1\'),5.0), 
     ((\'id2\',\'p         


        
相关标签:
2条回答
  • 2021-01-02 19:36

    My guess is that you want to transpose the data according to multiple fields.

    A simple way is to concatenate the target fields that you will group by, and make it a key in a paired RDD. For example:

    lines = sc.parallelize(['id1,pd1,t1,5.0', 'id2,pd2,t2,6.0', 'id1,pd1,t2,7.5', 'id1,pd1,t3,8.1'])
    rdd = lines.map(lambda x: x.split(',')).map(lambda x: (x[0] + ', ' + x[1], x[3])).reduceByKey(lambda a, b: a + ', ' + b)
    print rdd.collect()
    

    Then you will get the transposed result.

    [('id1, pd1', '5.0, 7.5, 8.1'), ('id2, pd2', '6.0')]
    
    0 讨论(0)
  • 2021-01-02 19:45

    I grouped ((id1,t1),((p1,5.0),(p2,6.0)) and so on ... as my map function. Later, I reduce using map_group which creates an array for [p1,p2, . . . ] and fills in values in their respective positions.

    def map_group(pgroup):
        x = np.zeros(19)
        x[0] = 1
        value_list = pgroup[1]
        for val in value_list:
            fno = val[0].split('.')[0]
            x[int(fno)-5] = val[1]
        return x
    
    tgbr = tfile.map(lambda d: ((d[0][0],d[0][2]),[(d[0][1],d[1])])) \
                    .reduceByKey(lambda p,q:p+q) \
                    .map(lambda d: (d[0], map_group(d)))
    

    This does feel like an expensive solution in terms of computation. But works for now.

    0 讨论(0)
提交回复
热议问题