Prepare my bigdata with Spark via Python

后端 未结 1 1153
独厮守ぢ
独厮守ぢ 2020-12-21 21:08

My 100m in size, quantized data:

(1424411938\', [3885, 7898])
(3333333333\', [3885, 7898])

Desired result:

(3885, [33333333         


        
相关标签:
1条回答
  • 2020-12-21 21:49

    You can use a bunch of basic pyspark transformations to achieve this.

    >>> rdd = sc.parallelize([(1424411938, [3885, 7898]),(3333333333, [3885, 7898])])
    >>> r = rdd.flatMap(lambda x: ((a,x[0]) for a in x[1]))
    

    We used flatMap to have a key, value pair for every item in x[1] and we changed the data line format to (a, x[0]), the a here is every item in x[1]. To understand flatMap better you can look to the documentation.

    >>> r2 = r.groupByKey().map(lambda x: (x[0],tuple(x[1])))
    

    We just grouped all key, value pairs by their keys and used tuple function to convert iterable to tuple.

    >>> r2.collect()
    [(3885, (1424411938, 3333333333)), (7898, (1424411938, 3333333333))]
    

    As you said you can use [:150] to have first 150 elements, I guess this would be proper usage:

    r2 = r.groupByKey().map(lambda x: (x[0],tuple(x[1])[:150]))

    I tried to be as explanatory as possible. I hope this helps.

    0 讨论(0)
提交回复
热议问题