问题
I have the the sample stored in the following list
sample = [AAAA,CGCG,TTTT,AT-T,CATC]
.. To illustrate the problem, I have denoted them as "Sets" below
Set1 AAAA
Set2 CGCG
Set3 TTTT
Set4 AT-T
Set5 CATC
- Eliminate all Sets where each every element in the set is identical to itself.
Output:
Set2 CGCG
Set4 AT-T
Set5 CATC
Perform pairwise comparison between the sets. (Set2 v Set4, Set 2v Set5, Set4 v Set5)
Each pairwise comparison can have only two types of combinations, if not then those pairwise comparisons are eliminated. eg,
Set2 Set5 C C G A C T G C
Here, there are more than two types of pairs (CC), (GA), (CT) and (GC). So this pairwise comparison cannot occur.
Every comparison can have only 2 combinations out of (AA, GG,CC,TT, AT,TA,AC,CA,AG,GA,GC,CG,GT,TG,CT,TC) ... basically all possible combinations of ACGT where order matters.
In the given example, more than 2 such combinations are found.
Hence, Set2 and Set4; Set4 and Set5 cannot be considered.Thus the only pairs, that remain are:
Output
Set2 CGCG
Set4 AT-T
In this pairwise comparison, remove any the element with "-" and its corresponding element in the other pair
Output Set2 CGG Set4 ATT
Calculate frequency of elements in Set2 and Set4. Calculate frequency of occurrence of types of pairs across the Sets (CA and GT pairs)
Output Set2 (C = 1/3, G = 2/3) Set4 (A = 1/3, T = 2/3) Pairs (CA = 1/3, GT = 2/3)
Calculate float(a) = (Pairs) - (Set2) * (Set4) for corresponding element (any one pair is sufficient)
eg. For CA pairs, float (a) = (freq of CA pairs) - (freq of C) * (freq of A)
NOTE: If the pair is AAAC and CCCA, the freq of C would it be 1/4, i.e. it is the frequency of the base over one of the pairs
Calculate
float (b) = float(a)/ (freq of C in CGG) * (freq G in CGG) * (freq A in ATT) * (ATT==> freq of T in ATT)
Repeat this for all pairwise comparisons
eg.
Set2 CGCG
Set4 AT-T
Set6 GCGC
Set2 v Set4, Set2 v Set6, Set4 v Set6
My half-baked code till now: ** I would prefer if all codes suggested would be in standard for-loop format and not comprehensions **
#Step 1
for i in sample:
for j in range(i):
if j = j+1 #This needs to be corrected to if all elements in i identical to each other i.e. if all "j's" are the same
del i
#insert line of code where sample1 = new sample with deletions as above
#Step 2
for i,i+1 in enumerate(sample):
#Step 3
for j in range(i):
for k in range (i+1):
#insert line of code to say only two types of pairs can be included, if yes continue else skip
#Step 4
if j = "-" or k = "-":
#Delete j/k and the corresponding element in the other pair
#Step 5
count_dict = {}
square_dict = {}
for base in list(i):
if base in count_dict:
count_dict[base] += 1
else:
count_dict[base] = 1
for allele in count_dict:
freq = (count_dict[allele] / len(i)) #frequencies of individual alleles
#Calculate frequency of pairs
#Step 6
No code yet
回答1:
I think this is what you want:
from collections import Counter
# Remove elements where all nucleobases are the same.
for index in range(len(sample) - 1, -1, -1):
if sample[index][:1] * len(sample[index]) == sample[index]:
del sample[index]
for indexA, setA in enumerate(sample):
for indexB, setB in enumerate(sample):
# Don't compare samples with themselves nor compare same pair twice.
if indexA <= indexB:
continue
# Calculate number of unique pairs
pair_count = Counter()
for pair in zip(setA, setB):
if '-' not in pair:
pair_count[pair] += 1
# Only analyse pairs of sets with 2 unique pairs.
if len(pair_count) != 2:
continue
# Count individual bases.
base_counter = Counter()
for pair, count in pair_count.items():
base_counter[pair[0]] += count
base_counter[pair[1]] += count
# Get the length of one of each item in the pair.
sequence_length = sum(pair_count.values())
# Convert counts to frequencies.
base_freq = {}
for base, count in base_counter.items():
base_freq[base] = count / float(sequence_length)
# Examine a pair from the two unique pairs to calculate float_a.
pair = list(pair_count)[0]
float_a = (pair_count[pair] / float(sequence_length)) - base_freq[pair[0]] * base_freq[pair[1]]
# Step 7!
float_b = float_a / float(base_freq.get('A', 0) * base_freq.get('T', 0) * base_freq.get('C', 0) * base_freq.get('G', 0))
Or, more Pythonically (with the list/dict comprehensions you don't want):
from collections import Counter
BASES = 'ATCG'
# Remove elements where all nucleobases are the same.
sample = [item for item in sample if item[:1] * len(item) != item]
for indexA, setA in enumerate(sample):
for indexB, setB in enumerate(sample):
# Don't compare samples with themselves nor compare same pair twice.
if indexA <= indexB:
continue
# Calculate number of unique pairs
relevant_pairs = [(elA, elB) for (elA, elB) in zip(setA, setB) if elA != '-' and elB != '-']
pair_count = Counter(relevant_pairs)
# Only analyse pairs of sets with 2 unique pairs.
if len(pair_count) != 2:
continue
# setA and setB as tuples with pairs involving '-' removed.
setA, setB = zip(*relevant_pairs)
# Get the total for each base.
seq_length = len(setA)
# Convert counts to frequencies.
base_freq = {base : count / float(seq_length) for (base, count) in (Counter(setA) + Counter(setB)).items()}
# Examine a pair from the two unique pairs to calculate float_a.
pair = list(pair_count)[0]
float_a = (pair_count[pair] / float(seq_length)) - base_freq[pair[0]] * base_freq[pair[1]]
# Step 7!
denominator = 1
for base in BASES:
denominator *= base_freq.get(base, 0)
float_b = float_a / denominator
来源:https://stackoverflow.com/questions/40072098/how-to-create-dictionaries-when-comparing-two-elements-at-a-time-in-python