Creating combination of value list with existing key - Pyspark

问题

So my rdd consists of data looking like:

(k, [v1,v2,v3...])

I want to create a combination of all sets of two for the value part.

So the end map should look like:

(k1, (v1,v2))
(k1, (v1,v3))
(k1, (v2,v3))

I know to get the value part, I would use something like

rdd.cartesian(rdd).filter(case (a,b) => a < b)

However, that requires the entire rdd to be passed (right?) not just the value part. I am unsure how to arrive at my desired end, I suspect its a groupby.

Also, ultimately, I want to get to the k,v looking like

((k1,v1,v2),1)

I know how to get from what I am looking for to that, but maybe its easier to go straight there?

Thanks.

回答1:

I think Israel's answer is a incomplete, so I go a step further.

import itertools

a = sc.parallelize([
    (1, [1,2,3,4]),
    (2, [3,4,5,6]),
    (3, [-1,2,3,4])
  ])

def combinations(row):
  l = row[1]
  k = row[0]
  return [(k, v) for v in itertools.combinations(l, 2)]

a.map(combinations).flatMap(lambda x: x).take(3)
# [(1, (1, 2)), (1, (1, 3)), (1, (1, 4))]

回答2:

Use itertools to create the combinations. Here is a demo:

import itertools

k, v1, v2, v3 = 'k1 v1 v2 v3'.split()

a = (k, [v1,v2,v3])

b = itertools.combinations(a[1], 2)
data = [(k, pair) for pair in b]

data will be:

[('k1', ('v1', 'v2')), ('k1', ('v1', 'v3')), ('k1', ('v2', 'v3'))]

回答3:

I have made this algorithm, but with higher numbers looks like that doesn't work or its very slow, it will run in a cluster of big data(cloudera), so i think that i have to put the function into pyspark, please give a hand if you can.

import pandas as pd import itertools as itts

number_list = [10953, 10423, 10053]

def reducer(nums): def ranges(n): print(n) return range(n, -1, -1)

num_list = list(map(ranges, nums)) return list(itts.product(*num_list))

data=pd.DataFrame(reducer(number_list)) print(data)

来源：https://stackoverflow.com/questions/39026480/creating-combination-of-value-list-with-existing-key-pyspark

标签

python