random generation of unique combination from two column

问题

I have two columns in a large file, say

pro1 lig1
pro2 lig2
pro3 lig3
pro4 lig1
.....

Second is column redundant. I want new random combinations of double size which should not match given combination, for example

pro1 lig2
pro1 lig4
pro2 lig1
pro2 lig3
pro3 lig4
pro3 lig2
pro4 lig2
pro4 lig3
.....

Thanks.

回答1:

If you want exactly two results for each value in column one, I'd brute force the non-matching part, with something like this:

import random

def gen_random_data(inputfile):
    with open(inputfile, "r") as f:
        column_a, column_b = zip(*(line.strip().split() for line in f))

    for a, b in zip(column_a, column_b):
        r = random.sample(column_b, 2)
        while b in r: # resample if we hit a duplicate of the original pair
            r = random.sample(column_b, 2)

        yield a, r[0]
        yield a, r[1]

回答2:

c = """pro1 lig1
pro2 lig2
pro3 lig3
pro4 lig4"""
lines = c.split("\n")
set_a = set()
set_b = set()
for line in lines:
    left, right = line.split(" ")
    set_a |= set([left])
    set_b |= set([right])

import random
for left in sorted(list(set_a)):
    rights = random.sample(set_b, 2)
    for right in rights:
        print left, right

OUTPUT

pro1 lig2
pro1 lig4
pro2 lig4
pro2 lig3
pro3 lig1
pro3 lig4
pro4 lig2
pro4 lig1

回答3:

Using some sorting, filtering, chaining and list comprehensions, you can try:

from itertools import chain
import random
random.seed(12345) # Only for fixing output, remove in productive code

words = [x.split() for x in """pro1 lig1
pro2 lig2
pro3 lig3
pro4 lig4""".split("\n")]

col1 = [w1 for w1,w2 in words]
col2 = [w2 for w1,w2 in words]

col1tocol2 = dict(words)        

combinations = chain(*[
                    [(w1, w2) for w2 in 
                        sorted(
                            filter(
                                lambda x: x != col1tocol2[w1], 
                                col2),
                            key=lambda x: random.random())
                            [:2]]
                    for w1 in col1])

for w1,w2 in combinations:
    print w1, w2

This gives:

pro1 lig3
pro1 lig2
pro2 lig4
pro2 lig1
pro3 lig4
pro3 lig2
pro4 lig3
pro4 lig1

The main trick is to use a random function as key for sorted.

回答4:

Say you have two columns:

col1 = ['pro1', 'pro2', ...]
col2 = ['lig1', 'lig2', ...]

Then the most straightforward way to do this would be to use itertools.product and random.sample as below:

from itertools import product
from random import sample

N = 100 #How many pairs to generate

randomPairs = sample(list(product(col1, col2)), N)

If col1 and col2 contain duplicate items, you can extract the unique items by doing set(col1) and set(col2).

Note that list(product(...)) will generate N * M element list, where N and M are the number of unique items in the columns. This may cause problems if N * M ends up being a very large number.

来源：https://stackoverflow.com/questions/14826378/random-generation-of-unique-combination-from-two-column

标签

python

Linux

excel

shell

random-sample