问题
I have (will have) data, that looks like the following:
Individual Nuk Name Position Individual.1 Nuk.1 Name.1 Position.1
Ind 1 A Locus_1988 23 Ind 1 A Locus_3333 15
Ind 2 A Locus_1988 23 Ind 2 G Locus_3333 15
Ind 3 G Locus_1988 23 Ind 3 A Locus_3333 15
Ind 4 G Locus_1988 23 Ind 4 - Locus_3333 15
Ind 5 A Locus_1988 23 Ind 5 G Locus_3333 15
Ind 6 G Locus_1988 23 Ind 6 G Locus_3333 15
Ind 1 C Locus_1988 23 Ind 1 C Locus_3333 18
Ind 2 T Locus_1988 23 Ind 2 C Locus_3333 18
Ind 3 T Locus_1988 23 Ind 3 T Locus_3333 18
Ind 4 C Locus_1988 23 Ind 4 - Locus_3333 18
Ind 5 - Locus_1988 23 Ind 5 C Locus_3333 18
Ind 6 T Locus_1988 23 Ind 6 T Locus_3333 18
Ind 1 T Locus_2301 12 Ind 1 T Locus_4123 38
Ind 2 T Locus_2301 12 Ind 2 T Locus_4123 38
Ind 3 A Locus_2301 12 Ind 3 - Locus_4123 38
Ind 4 - Locus_2301 12 Ind 4 A Locus_4123 38
Ind 5 A Locus_2301 12 Ind 5 A Locus_4123 38
Ind 6 T Locus_2301 12 Ind 6 T Locus_4123 38
Ind 1 G Locus_2301 31 Ind 1 G Locus_4123 52
Ind 2 C Locus_2301 31 Ind 2 C Locus_4123 52
Ind 3 C Locus_2301 31 Ind 3 G Locus_4123 52
Ind 4 G Locus_2301 31 Ind 4 C Locus_4123 52
Ind 5 - Locus_2301 31 Ind 5 C Locus_4123 52
Ind 6 G Locus_2301 31 Ind 6 - Locus_4123 52
The data is built up as pairs of loci (so in the above e.g. Locus_1988 and Locus_3333 is a pair). For each of the positions within a pair, I need to do a Four-Gamete Test (FGT) on the Nuk, i.e. test in all possible 2-pair combinations of any given 2-letter combination from the four possible letters GCAT.
So for the data above, for the pair Locus_1988 Position 23 + Locus_3333 Position 15
the combinations present are AA AG GA G- AG GG
. As the combinations AA, AG, GA and GG are present, this pair will have passed the FGT), and this needs to be registered (i.e. with a 1 in a new_column).
The next group in the above data is Locus_1988 Position 23 + Locus_3333
Position 18 has the following combinations: CC TC TT C- -C TT
. As the combination CT is missing, this group will not have passed the FGT (registered as 0 in the new_column).
How would you proceed to do this test?
There are many loci, with many (30) individuals in each, and several positions within some, but not all loci, to be tested.
I am thinking, that it should be possible to build the test along the lines of this:
if(grepl("AG" & "GA" & "AA" & "GG" | "AC" & "CA" & "AA" & "CC" | "AT" & "TA" & "AA" & "TT" | "CT" & "TC" & "CC" & "TT" | "CG" & "GC" & "CC" & "GG" | "GT" & "TG" & "GG" & "TT", data="combination of the two columns")) print("1") else print("0")
But I am apparently not allowed to use the & | operators. Also I'm having a lot of trouble figuring out how to specify to do this with reference to firstly the Name and secondly the Position. Would you give each group a unique name in a new column (as below), and specify to do the test on each group?
Individual Nuk Name Pos Individual.1 Nuk.1 Name.1 Pos.1 Grp
Ind 1 A Locus_1988 23 Ind 1 A Locus_3333 15 1
Ind 2 A Locus_1988 23 Ind 2 G Locus_3333 15 1
Ind 3 G Locus_1988 23 Ind 3 A Locus_3333 15 1
Ind 4 G Locus_1988 23 Ind 4 - Locus_3333 15 1
Ind 5 A Locus_1988 23 Ind 5 G Locus_3333 15 1
Ind 6 G Locus_1988 23 Ind 6 G Locus_3333 15 1
Ind 1 C Locus_1988 23 Ind 1 C Locus_3333 18 2
Ind 2 T Locus_1988 23 Ind 2 C Locus_3333 18 2
Ind 3 T Locus_1988 23 Ind 3 T Locus_3333 18 2
Ind 4 C Locus_1988 23 Ind 4 - Locus_3333 18 2
Ind 5 - Locus_1988 23 Ind 5 C Locus_3333 18 2
Ind 6 T Locus_1988 23 Ind 6 T Locus_3333 18 2
Ind 1 T Locus_2301 12 Ind 1 T Locus_4123 38 3
Ind 2 T Locus_2301 12 Ind 2 T Locus_4123 38 3
Ind 3 A Locus_2301 12 Ind 3 - Locus_4123 38 3
Ind 4 - Locus_2301 12 Ind 4 A Locus_4123 38 3
Ind 5 A Locus_2301 12 Ind 5 A Locus_4123 38 3
Ind 6 T Locus_2301 12 Ind 6 T Locus_4123 38 3
Ind 1 G Locus_2301 31 Ind 1 G Locus_4123 52 4
Ind 2 C Locus_2301 31 Ind 2 C Locus_4123 52 4
Ind 3 C Locus_2301 31 Ind 3 G Locus_4123 52 4
Ind 4 G Locus_2301 31 Ind 4 C Locus_4123 52 4
Ind 5 - Locus_2301 31 Ind 5 C Locus_4123 52 4
Ind 6 G Locus_2301 31 Ind 6 - Locus_4123 52 4
I'm thinking this could be done in a loop, but I'm afraid this might take a long time to process, as I have a lot of data.
回答1:
Split the data (df1
) by positions and locus names:
split1 <- split(df1, list(df1$Name, df1$Position, df1$Name.1, df1$Position.1), drop = TRUE)
Create tests:
do.call(rbind,
lapply(split1, function(x) {
all_letters <- union( x$Nuk, x$Nuk.1 )
all_letters <- all_letters[all_letters != "-"]
letter_comb <- expand.grid(all_letters, all_letters, stringsAsFactors = FALSE)
data.frame(
FGT = all(
sapply( seq_len(nrow(letter_comb)), function(i) {
any(x$Nuk == letter_comb[i,1] & x$Nuk.1 == letter_comb[i,2])
})
),
Name = x$Name[1], Position = x$Position[1],
Name.1 = x$Name.1[1], Position.1 = x$Position.1[1]
)
})
)
Result:
# FGT Name Position Name.1 Position.1
# Locus_1988.23.Locus_3333.15 TRUE Locus_1988 23 Locus_3333 15
# Locus_1988.23.Locus_3333.18 FALSE Locus_1988 23 Locus_3333 18
# Locus_2301.12.Locus_4123.38 FALSE Locus_2301 12 Locus_4123 38
# Locus_2301.31.Locus_4123.52 TRUE Locus_2301 31 Locus_4123 52
来源:https://stackoverflow.com/questions/28988469/four-gamete-test-in-r