My 100m in size, quantized data:
(1424411938\', [3885, 7898])
(3333333333\', [3885, 7898])
Desired result:
(3885, [33333333
You can use a bunch of basic pyspark transformations to achieve this.
>>> rdd = sc.parallelize([(1424411938, [3885, 7898]),(3333333333, [3885, 7898])])
>>> r = rdd.flatMap(lambda x: ((a,x[0]) for a in x[1]))
We used flatMap
to have a key, value pair for every item in x[1]
and we changed the data line format to (a, x[0])
, the a
here is every item in x[1]
. To understand flatMap
better you can look to the documentation.
>>> r2 = r.groupByKey().map(lambda x: (x[0],tuple(x[1])))
We just grouped all key, value pairs by their keys and used tuple function to convert iterable to tuple.
>>> r2.collect()
[(3885, (1424411938, 3333333333)), (7898, (1424411938, 3333333333))]
As you said you can use [:150] to have first 150 elements, I guess this would be proper usage:
r2 = r.groupByKey().map(lambda x: (x[0],tuple(x[1])[:150]))
I tried to be as explanatory as possible. I hope this helps.