I have a bunch of tuples which are in form of composite keys and values. For example,
tfile.collect() = [((\'id1\',\'pd1\',\'t1\'),5.0),
((\'id2\',\'p
My guess is that you want to transpose the data according to multiple fields.
A simple way is to concatenate the target fields that you will group by, and make it a key in a paired RDD. For example:
lines = sc.parallelize(['id1,pd1,t1,5.0', 'id2,pd2,t2,6.0', 'id1,pd1,t2,7.5', 'id1,pd1,t3,8.1'])
rdd = lines.map(lambda x: x.split(',')).map(lambda x: (x[0] + ', ' + x[1], x[3])).reduceByKey(lambda a, b: a + ', ' + b)
print rdd.collect()
Then you will get the transposed result.
[('id1, pd1', '5.0, 7.5, 8.1'), ('id2, pd2', '6.0')]
I grouped ((id1,t1),((p1,5.0),(p2,6.0)) and so on ... as my map function. Later, I reduce using map_group which creates an array for [p1,p2, . . . ] and fills in values in their respective positions.
def map_group(pgroup):
x = np.zeros(19)
x[0] = 1
value_list = pgroup[1]
for val in value_list:
fno = val[0].split('.')[0]
x[int(fno)-5] = val[1]
return x
tgbr = tfile.map(lambda d: ((d[0][0],d[0][2]),[(d[0][1],d[1])])) \
.reduceByKey(lambda p,q:p+q) \
.map(lambda d: (d[0], map_group(d)))
This does feel like an expensive solution in terms of computation. But works for now.