问题
I have two data frames/ lists of datas 'humanSplitand
ratSplit` and they are of the form
> ratSplit$Kidney_F_GSM1328570
ratGene ratReplicate alignment RNAtype
1 Crot Kidney_F_GSM1328570 7 REV
2 Crot Kidney_F_GSM1328570 12 REV
3 Crot Kidney_F_GSM1328570 4 REV
and
> humanSplit$Fetal_Brain_408_AGTCAA_L009_R1_report.txt
humanGene humanReplicate alignment RNAtype
53 ZFP28 Fetal_Brain_408_AGTCAA_L009_R1_report.txt 5 reg
55 RC3H1 Fetal_Brain_408_AGTCAA_L009_R1_report.txt 9 reg
56 IFI27 Fetal_Brain_408_AGTCAA_L009_R1_report.txt 4 reg
And another file used below is geneList of the form:
ABAT,Abat
ABCA1,Abca1
ABCA12,Abca12
ABCA2,Abca2
ABCA3,Abca17
ABCA4,Abca4
ABCA5,Abca5
now I want to do fisher's exact test between all the elements pair combination between ratSplit
and humanSplit
after some data manipulation. And ultimately want to write the result of fisher's test in a csv
file. Right now I am doing double for loop. But I am wondering how I can use sapply
or other related things to make it more efficient.
currently I am doing the following thing: Here I am first making a data.frame
result
where I store/append all the information got from fisher's test in a pair in each step. Then finally when the whole loop is done, I write the final result
in a csv
file. my understanding is to use sapply
I need to transform the inside of the loop into a function and then call sapply. But I am not sure what's the best way to optimize it. Any help would be much appreciated
result <- data.frame(humanReplicate = "human_replicate", ratReplicate = "rat_replicate", pvalue = "p-value", alternative = "alternative_hypothesis",
Conf.int1 = "conf.int1", Conf.int2 ="conf.int2", oddratio = "Odd_Ratio")
for(i in 1:length(ratSplit)) {
for(j in 1:length(humanSplit)) {
ratReplicateName <- names(ratSplit[i])
humanReplicateName <- names(humanSplit[j])
#merging above two based on the one-to-one gene mapping as in geneList defined above.
mergedHumanData <-merge(geneList,humanSplit[[j]], by.x = "human", by.y = "humanGene")
mergedRatData <- merge(geneList, ratSplit[[i]], by.x = "rat", by.y = "ratGene")
mergedHumanData <- mergedHumanData[,c(1,2,4,5)] #rearrange column
mergedRatData <- mergedRatData[,c(2,1,4,5)] #rearrange column
mergedHumanRatData <- rbind(mergedHumanData,mergedRatData) #now the columns are "human", "rat", "alignment", "RNAtype"
agg <- aggregate(RNAtype ~ human+rat, data= mergedHumanRatData, FUN=getGeneType) #agg to make HmYn form
HmRnTable <- table(agg$RNAtype) #table of HmRn ie RNAtype in human and rat.
#now assign these numbers to variables HmYn. Consider cases when some form of HmRy is not present in the table. That's why
#is.integer0 function is used
HyRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRy"]), 0, HmRnTable[names(HmRnTable) == "HyRy"][[1]])
HnRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRn"]), 0, HmRnTable[names(HmRnTable) == "HnRn"][[1]])
HyRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRn"]), 0, HmRnTable[names(HmRnTable) == "HyRn"][[1]])
HnRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRy"]), 0, HmRnTable[names(HmRnTable) == "HnRy"][[1]])
contingencyTable <- matrix(c(HnRn,HnRy,HyRn,HyRy), nrow = 2)
fisherTest <- fisher.test(contingencyTable)
newLine <- data.frame(t(c(humanReplicate = humanReplicateName, ratReplicate = ratReplicateName, pvalue = fisherTest$p,
alternative = fisherTest$alternative, Conf.int1 = fisherTest$conf.int[1], Conf.int2 =fisherTest$conf.int[2],
oddratio = fisherTest$estimate[[1]])))
result <-rbind(result,newLine)
}
}
write.table(result, file = "newData5.csv", row.names = FALSE, append = FALSE, col.names = TRUE, sep = ",")
回答1:
It's difficult to test this since we're missing geneList
, but I'm inferring that the code works correctly, so you merely want to speed things up. Here are some pointers to help:
Don't predefine
result
like this. By setting the first entry of each column to be a string of the column name, you ensure that all subsequent entries will be coerced into strings. Though this doesn't necessarily bite you since you're eventually writing to a CSV anyway, it's bad form. It will bite you if you intend to read it back in to R, since the first row will be used as the column names, but the second row will all be strings forcing subsequent data to be strings as well. (You'll then have to clean your own data, wasteful.)At the end of your script, you call
rbind
. This might do alright for a while, but repeated calls torbind
will result in the whole data.frame being copied each time. With more rows being appended, this will cause a not-insignificant slowdown in your code. This can be fixed by one of two methods, listed below.Since you use each of
names(HmRnTable) == "HyRy"
twice, my technique is to first save that to a vector (or scalar if you usewhich(...)
), and then use this variable in the subsetting ofHmRnTable
. It will likely speed things up a little but may also make the code a little easier to read. In fact, you can shorten each of those four assignments to something like (untested):idx <- is.integer0(HmRnTable[names(HmRnTable) == 'HyRy']) HyRy <- HmRnTable[idx][[1]] HyRy[idx] <- 0 ## repeat for HyRn, HnRy, and HnRn
I strongly urge you to take the majority of the code and put it in a function that takes either two arguments (
i
andj
) or four arguments (list1, index1, list2, index2). (Which you do depends on how much OCD you have with regards to variable scope references.) I'll assume that the function will literally return the results fromfisher.test
with no massaging. This will make for much easier testing while making the function, and for inclusion later on in this script. I'll reference function asmyfisher(i,j)
below.I'm inferring that you have a high number of comparisons to run (since each iteration really shouldn't take that long). @Floo0's comment about
outer
can work, as canexpand.grid
. Either way, you are comparing each element of list1 with each element of list2. If you start with:(eg <- expand.grid(i = 1:length(ratSplit), j = 1:length(humanSplit))) ## i j ## 1 1 1 ## 2 2 1 ## 3 3 1 ## 4 4 1 ## ...
This gives us an easy data.frame on which we can use
apply
. Honestly, though, I like the elegance ofddply
in this case, since it easily goes from a data.frame to a data.frame, easily.library(plyr) ret <- ddply(eg, .(i, j), function(df) { with(myfisher(df$i, df$j), data.frame(pv = p.value, ci1 = conf.int[1], ci2 = conf.int[2], alt = alternative)) })
Note that I expressly did not include
humanReplicateName
orratReplicateName
since those can be added before the ddply (or after, referencingret$
instead ofeg$
) with:eg$ratReplicateName <- names(ratSplit[ eg$i ]) eg$humanReplicateName <- names(humanSplit[ eg$i ])
and the names will magically be in the output, too. Less to deal with inside the loop.
Up until now, this will result in one data.frame that you can then save to a CSV.
I'll make one more recommendation that may be overkill depending on how long this will be running. I've found at times that my long-running scripts are interrupted, perhaps because I had to adjust something on-the-fly; I found a bug; or if the computer had to reboot. There is no easy way to allow mid-stream continuation, but I've worked some techniques to mitigate that.
Instead of
ddply
, used_ply
which does not assume the type of the return value. Instead of returning a (single-row) data.frame, immediately save that one row (with or without headers, your call) to a file. Though the below code is not truly "atomic" and therefore race-conditions may occur, it's robust enough to meet most of our needs:library(plyr) savedir <- './output' ret <- ddply(eg, .(i, j), function(df) { fn <- file.path(savedir, sprintf('%s-%s.csv', df$i, df$j)) if (! file.exists(fn)) { ret <- with(myfisher(df$i, df$j), data.frame(pv = p.value, ci1 = conf.int[1], ci2 = conf.int[2], alt = alternative)) write.table(ret, file = fn, sep = ",", append = FALSE, row.names = FALSE, col.names = TRUE) } })
One goodness of this is that if/when it is interrupted, the most you will have to do is look at all of the files within "./output/", delete files with 0 bytes, and then re-run and it will only execute on the missing files. Oh, and it gets better.
Parallelizing it. If you need to go this far (and some functions don't see as much improvement as others), you can instead use multiple cores on your system. You could do something like this:
library(parallel) cl <- makeCluster(detectCores() - 1) # I like to keep one free clusterEvalQ(cl, { load('myTwoLists.rda') # with ratSplit and humanSplit variables source('mycode.R') # with myfisher(i,j) ## any libraries you may want/need to add, if you get more advanced }) eg <- expand.grid(i = 1:length(ratSplit), j = 1:length(humanSplit)) eg$ratReplicateName <- names(ratSplit[ eg$i ]) eg$humanReplicateName <- names(humanSplit[ eg$i ]) ign <- parApply(cl, eg, 1, function(r) { i <- r[1] ; j <- r[2] fn <- file.path(savedir, sprintf('%s-%s.csv', i, j) if (! file.exists(fn)) { ret <- with(myfisher(i, j), data.frame(ratName = r[3], humanName = r[4], pv = p.value, ci1 = conf.int[1], ci2 = conf.int[2], alt = alternative)) write.table(ret, file = fn, sep = ",", append = FALSE, row.names = FALSE, col.names = TRUE) } }) stopCluster(cl)
Notice that we aren't referencing the row as a data.frame anymore. I'd love it if Hadley's code worked natively with parallel.
(For all I know, it does and I just haven't found it yet!)Edit: none of my past projects usingparallel
were usingddply
, and vice versa, so I have never played withddply(..., .parallel = TRUE)
which usesforeach
.
Caveat Emptor: this code has not been tested in this context, though it is almost copy/pasted from working code. Because of edits, I may be off-by-one or missing a paren. Hope it helps!
来源:https://stackoverflow.com/questions/25223808/applying-sapply-or-other-apply-function-instead-of-nested-for-loop-for-lists-of