问题
I have two columns in a large file, say
pro1 lig1
pro2 lig2
pro3 lig3
pro4 lig1
.....
Second is column redundant. I want new random combinations of double size which should not match given combination, for example
pro1 lig2
pro1 lig4
pro2 lig1
pro2 lig3
pro3 lig4
pro3 lig2
pro4 lig2
pro4 lig3
.....
Thanks.
回答1:
If you want exactly two results for each value in column one, I'd brute force the non-matching part, with something like this:
import random
def gen_random_data(inputfile):
with open(inputfile, "r") as f:
column_a, column_b = zip(*(line.strip().split() for line in f))
for a, b in zip(column_a, column_b):
r = random.sample(column_b, 2)
while b in r: # resample if we hit a duplicate of the original pair
r = random.sample(column_b, 2)
yield a, r[0]
yield a, r[1]
回答2:
c = """pro1 lig1
pro2 lig2
pro3 lig3
pro4 lig4"""
lines = c.split("\n")
set_a = set()
set_b = set()
for line in lines:
left, right = line.split(" ")
set_a |= set([left])
set_b |= set([right])
import random
for left in sorted(list(set_a)):
rights = random.sample(set_b, 2)
for right in rights:
print left, right
OUTPUT
pro1 lig2
pro1 lig4
pro2 lig4
pro2 lig3
pro3 lig1
pro3 lig4
pro4 lig2
pro4 lig1
回答3:
Using some sorting, filtering, chaining and list comprehensions, you can try:
from itertools import chain
import random
random.seed(12345) # Only for fixing output, remove in productive code
words = [x.split() for x in """pro1 lig1
pro2 lig2
pro3 lig3
pro4 lig4""".split("\n")]
col1 = [w1 for w1,w2 in words]
col2 = [w2 for w1,w2 in words]
col1tocol2 = dict(words)
combinations = chain(*[
[(w1, w2) for w2 in
sorted(
filter(
lambda x: x != col1tocol2[w1],
col2),
key=lambda x: random.random())
[:2]]
for w1 in col1])
for w1,w2 in combinations:
print w1, w2
This gives:
pro1 lig3
pro1 lig2
pro2 lig4
pro2 lig1
pro3 lig4
pro3 lig2
pro4 lig3
pro4 lig1
The main trick is to use a random function as key
for sorted
.
回答4:
Say you have two columns:
col1 = ['pro1', 'pro2', ...]
col2 = ['lig1', 'lig2', ...]
Then the most straightforward way to do this would be to use itertools.product
and random.sample
as below:
from itertools import product
from random import sample
N = 100 #How many pairs to generate
randomPairs = sample(list(product(col1, col2)), N)
If col1
and col2
contain duplicate items, you can extract the unique items by doing set(col1)
and set(col2)
.
Note that list(product(...))
will generate N * M
element list, where N
and M
are the number of unique items in the columns. This may cause problems if N * M
ends up being a very large number.
来源:https://stackoverflow.com/questions/14826378/random-generation-of-unique-combination-from-two-column