I have an RDD with the following rows:
[(id,value)]
How would you sum the values of all rows in the RDD?
Simply use sum, you just need to get the data into a list.
For example
sc.parallelize([('id', [1, 2, 3]), ('id2', [3, 4, 5])]) \
.flatMap(lambda tup: tup[1]) \ # [1, 2, 3, 3, 4, 5]
.sum()
Outputs 18
Similarly, just use values()
to get that second column as an RDD on it's own.
sc.parallelize([('id', 6), ('id2', 12)]) \
.values() \ # [6, 12]
.sum()