Four-Gamete-Test in R | 易学教程

问题

I have (will have) data, that looks like the following:

Individual Nuk Name       Position Individual.1 Nuk.1 Name.1     Position.1
Ind 1      A   Locus_1988 23       Ind 1        A     Locus_3333 15
Ind 2      A   Locus_1988 23       Ind 2        G     Locus_3333 15
Ind 3      G   Locus_1988 23       Ind 3        A     Locus_3333 15
Ind 4      G   Locus_1988 23       Ind 4        -     Locus_3333 15
Ind 5      A   Locus_1988 23       Ind 5        G     Locus_3333 15
Ind 6      G   Locus_1988 23       Ind 6        G     Locus_3333 15
Ind 1      C   Locus_1988 23       Ind 1        C     Locus_3333 18
Ind 2      T   Locus_1988 23       Ind 2        C     Locus_3333 18
Ind 3      T   Locus_1988 23       Ind 3        T     Locus_3333 18
Ind 4      C   Locus_1988 23       Ind 4        -     Locus_3333 18
Ind 5      -   Locus_1988 23       Ind 5        C     Locus_3333 18
Ind 6      T   Locus_1988 23       Ind 6        T     Locus_3333 18
Ind 1      T   Locus_2301 12       Ind 1        T     Locus_4123 38
Ind 2      T   Locus_2301 12       Ind 2        T     Locus_4123 38
Ind 3      A   Locus_2301 12       Ind 3        -     Locus_4123 38
Ind 4      -   Locus_2301 12       Ind 4        A     Locus_4123 38
Ind 5      A   Locus_2301 12       Ind 5        A     Locus_4123 38
Ind 6      T   Locus_2301 12       Ind 6        T     Locus_4123 38
Ind 1      G   Locus_2301 31       Ind 1        G     Locus_4123 52
Ind 2      C   Locus_2301 31       Ind 2        C     Locus_4123 52
Ind 3      C   Locus_2301 31       Ind 3        G     Locus_4123 52
Ind 4      G   Locus_2301 31       Ind 4        C     Locus_4123 52
Ind 5      -   Locus_2301 31       Ind 5        C     Locus_4123 52
Ind 6      G   Locus_2301 31       Ind 6        -     Locus_4123 52

The data is built up as pairs of loci (so in the above e.g. Locus_1988 and Locus_3333 is a pair). For each of the positions within a pair, I need to do a Four-Gamete Test (FGT) on the Nuk, i.e. test in all possible 2-pair combinations of any given 2-letter combination from the four possible letters GCAT. So for the data above, for the pair Locus_1988 Position 23 + Locus_3333 Position 15 the combinations present are AA AG GA G- AG GG. As the combinations AA, AG, GA and GG are present, this pair will have passed the FGT), and this needs to be registered (i.e. with a 1 in a new_column). The next group in the above data is Locus_1988 Position 23 + Locus_3333 Position 18 has the following combinations: CC TC TT C- -C TT. As the combination CT is missing, this group will not have passed the FGT (registered as 0 in the new_column).

How would you proceed to do this test?

There are many loci, with many (30) individuals in each, and several positions within some, but not all loci, to be tested.

I am thinking, that it should be possible to build the test along the lines of this:

But I am apparently not allowed to use the & | operators. Also I'm having a lot of trouble figuring out how to specify to do this with reference to firstly the Name and secondly the Position. Would you give each group a unique name in a new column (as below), and specify to do the test on each group?

Individual Nuk Name       Pos Individual.1 Nuk.1 Name.1          Pos.1 Grp
Ind 1      A   Locus_1988 23       Ind 1        A     Locus_3333 15    1         
Ind 2      A   Locus_1988 23       Ind 2        G     Locus_3333 15    1
Ind 3      G   Locus_1988 23       Ind 3        A     Locus_3333 15    1
Ind 4      G   Locus_1988 23       Ind 4        -     Locus_3333 15    1
Ind 5      A   Locus_1988 23       Ind 5        G     Locus_3333 15    1
Ind 6      G   Locus_1988 23       Ind 6        G     Locus_3333 15    1
Ind 1      C   Locus_1988 23       Ind 1        C     Locus_3333 18    2
Ind 2      T   Locus_1988 23       Ind 2        C     Locus_3333 18    2
Ind 3      T   Locus_1988 23       Ind 3        T     Locus_3333 18    2
Ind 4      C   Locus_1988 23       Ind 4        -     Locus_3333 18    2
Ind 5      -   Locus_1988 23       Ind 5        C     Locus_3333 18    2
Ind 6      T   Locus_1988 23       Ind 6        T     Locus_3333 18    2
Ind 1      T   Locus_2301 12       Ind 1        T     Locus_4123 38    3
Ind 2      T   Locus_2301 12       Ind 2        T     Locus_4123 38    3
Ind 3      A   Locus_2301 12       Ind 3        -     Locus_4123 38    3
Ind 4      -   Locus_2301 12       Ind 4        A     Locus_4123 38    3
Ind 5      A   Locus_2301 12       Ind 5        A     Locus_4123 38    3
Ind 6      T   Locus_2301 12       Ind 6        T     Locus_4123 38    3
Ind 1      G   Locus_2301 31       Ind 1        G     Locus_4123 52    4
Ind 2      C   Locus_2301 31       Ind 2        C     Locus_4123 52    4
Ind 3      C   Locus_2301 31       Ind 3        G     Locus_4123 52    4
Ind 4      G   Locus_2301 31       Ind 4        C     Locus_4123 52    4
Ind 5      -   Locus_2301 31       Ind 5        C     Locus_4123 52    4
Ind 6      G   Locus_2301 31       Ind 6        -     Locus_4123 52    4

I'm thinking this could be done in a loop, but I'm afraid this might take a long time to process, as I have a lot of data.

回答1:

Split the data (df1) by positions and locus names:

split1 <- split(df1, list(df1$Name, df1$Position, df1$Name.1, df1$Position.1), drop = TRUE)

Create tests:

do.call(rbind, 
  lapply(split1, function(x) {
    all_letters <- union( x$Nuk, x$Nuk.1 )
    all_letters <- all_letters[all_letters != "-"]
    letter_comb <- expand.grid(all_letters, all_letters, stringsAsFactors = FALSE)
    data.frame( 
      FGT = all(
        sapply( seq_len(nrow(letter_comb)), function(i) {
          any(x$Nuk == letter_comb[i,1] & x$Nuk.1 == letter_comb[i,2])
        })
      ),
      Name = x$Name[1], Position = x$Position[1], 
      Name.1 = x$Name.1[1], Position.1 = x$Position.1[1] 
    )  
  })
)

Result:

#                               FGT       Name Position     Name.1 Position.1
# Locus_1988.23.Locus_3333.15  TRUE Locus_1988       23 Locus_3333         15
# Locus_1988.23.Locus_3333.18 FALSE Locus_1988       23 Locus_3333         18
# Locus_2301.12.Locus_4123.38 FALSE Locus_2301       12 Locus_4123         38
# Locus_2301.31.Locus_4123.52  TRUE Locus_2301       31 Locus_4123         52

来源：https://stackoverflow.com/questions/28988469/four-gamete-test-in-r

标签

compare

dna-sequence