How to calculate frequency of elements for pairwise comparisons of lists in Python?

问题

I have the the sample stored in the following list

 sample = [AAAA,CGCG,TTTT,AT-T,CATC]

.. To illustrate the problem, I have denoted them as "Sets" below

Set1 AAAA
Set2 CGCG
Set3 TTTT
Set4 AT-T
Set5 CATC

Eliminate all Sets where each every element in the set is identical to itself.

Output:

 Set2 CGCG
 Set4 AT-T
 Set5 CATC

Perform pairwise comparison between the sets. (Set2 v Set4, Set 2v Set5, Set4 v Set5)
Each pairwise comparison can have only two types of combinations, if not then those pairwise comparisons are eliminated. eg,
```
Set2    Set5
C       C
G       A
C       T 
G       C
```

Here, there are more than two types of pairs (CC), (GA), (CT) and (GC). So this pairwise comparison cannot occur.

Every comparison can have only 2 combinations out of (AA, GG,CC,TT, AT,TA,AC,CA,AG,GA,GC,CG,GT,TG,CT,TC) ... basically all possible combinations of ACGT where order matters.

In the given example, more than 2 such combinations are found.

Hence, Set2 and Set4; Set4 and Set5 cannot be considered.Thus the only pairs, that remain are:

Output
Set2 CGCG
Set4 AT-T

In this pairwise comparison, remove any the element with "-" and its corresponding element in the other pair
```
Output    
Set2 CGG
Set4 ATT
```
Calculate frequency of elements in Set2 and Set4. Calculate frequency of occurrence of types of pairs across the Sets (CA and GT pairs)
```
Output
Set2 (C = 1/3, G = 2/3)
Set4 (A = 1/3, T = 2/3)
Pairs (CA = 1/3, GT = 2/3)
```
Calculate float(a) = (Pairs) - (Set2) * (Set4) for corresponding element (any one pair is sufficient)
```
eg. For CA pairs, float (a) = (freq of CA pairs) - (freq of C) * (freq of A)
```

NOTE: If the pair is AAAC and CCCA, the freq of C would it be 1/4, i.e. it is the frequency of the base over one of the pairs

Calculate

float (b) = float(a)/ (freq of C in CGG) * (freq G in CGG) * (freq A in ATT) * (ATT==> freq of T in ATT)

Repeat this for all pairwise comparisons

eg.

Set2 CGCG
Set4 AT-T
Set6 GCGC

Set2 v Set4, Set2 v Set6, Set4 v Set6

My half-baked code till now: ** I would prefer if all codes suggested would be in standard for-loop format and not comprehensions **

#Step 1
for i in sample: 
    for j in range(i):
        if j = j+1    #This needs to be corrected to if all elements in i identical to each other i.e. if all "j's" are the same
                        del i 
    #insert line of code where sample1 = new sample with deletions as above

#Step 2
    for i,i+1 in enumerate(sample):
    #Step 3
    for j in range(i):
        for k in range (i+1):
        #insert line of code to say only two types of pairs can be included, if yes continue else skip
            #Step 4
            if j = "-" or k = "-":
                #Delete j/k and the corresponding element in the other pair
                #Step 5
                count_dict = {}
                    square_dict = {}
                for base in list(i):
                    if base in count_dict:
                            count_dict[base] += 1
                    else:
                            count_dict[base] = 1
                    for allele in count_dict:
                    freq = (count_dict[allele] / len(i)) #frequencies of individual alleles
                    #Calculate frequency of pairs 
                #Step 6
                No code yet

回答1:

I think this is what you want:

from collections import Counter

# Remove elements where all nucleobases are the same.
for index in range(len(sample) - 1, -1, -1):
    if sample[index][:1] * len(sample[index]) == sample[index]:
        del sample[index]

for indexA, setA in enumerate(sample):
    for indexB, setB in enumerate(sample):
        # Don't compare samples with themselves nor compare same pair twice.
        if indexA <= indexB:
            continue

        # Calculate number of unique pairs
        pair_count = Counter()
        for pair in zip(setA, setB):
            if '-' not in pair:
                pair_count[pair] += 1

        # Only analyse pairs of sets with 2 unique pairs.
        if len(pair_count) != 2:
            continue

        # Count individual bases.
        base_counter = Counter()
        for pair, count in pair_count.items():
            base_counter[pair[0]] += count
            base_counter[pair[1]] += count

        # Get the length of one of each item in the pair.
        sequence_length = sum(pair_count.values())

        # Convert counts to frequencies.
        base_freq = {}
        for base, count in base_counter.items():
            base_freq[base] = count / float(sequence_length)

        # Examine a pair from the two unique pairs to calculate float_a.
        pair = list(pair_count)[0]
        float_a = (pair_count[pair] / float(sequence_length)) - base_freq[pair[0]] * base_freq[pair[1]]

        # Step 7!
        float_b = float_a / float(base_freq.get('A', 0) * base_freq.get('T', 0) * base_freq.get('C', 0) * base_freq.get('G', 0))

Or, more Pythonically (with the list/dict comprehensions you don't want):

from collections import Counter

BASES = 'ATCG'

# Remove elements where all nucleobases are the same.
sample = [item for item in sample if item[:1] * len(item) != item]

for indexA, setA in enumerate(sample):
    for indexB, setB in enumerate(sample):
        # Don't compare samples with themselves nor compare same pair twice.
        if indexA <= indexB:
            continue

        # Calculate number of unique pairs
        relevant_pairs = [(elA, elB) for (elA, elB) in zip(setA, setB) if elA != '-' and elB != '-']
        pair_count = Counter(relevant_pairs)

        # Only analyse pairs of sets with 2 unique pairs.
        if len(pair_count) != 2:
            continue

        # setA and setB as tuples with pairs involving '-' removed.
        setA, setB = zip(*relevant_pairs)

        # Get the total for each base.
        seq_length = len(setA)

        # Convert counts to frequencies.
        base_freq = {base : count / float(seq_length) for (base, count) in (Counter(setA) + Counter(setB)).items()}

        # Examine a pair from the two unique pairs to calculate float_a.
        pair = list(pair_count)[0]
        float_a = (pair_count[pair] / float(seq_length)) - base_freq[pair[0]] * base_freq[pair[1]]

        # Step 7!
        denominator = 1
        for base in BASES:
            denominator *= base_freq.get(base, 0)

        float_b = float_a / denominator

来源：https://stackoverflow.com/questions/40072098/how-to-create-dictionaries-when-comparing-two-elements-at-a-time-in-python

标签

python

list

for-loop

dictionary

frequency