applying sapply or other apply function instead of nested for loop for lists of data frames

问题

I have two data frames/ lists of datas 'humanSplitandratSplit` and they are of the form

> ratSplit$Kidney_F_GSM1328570
  ratGene        ratReplicate alignment RNAtype
1    Crot Kidney_F_GSM1328570         7     REV
2    Crot Kidney_F_GSM1328570        12     REV
3    Crot Kidney_F_GSM1328570         4     REV

and

> humanSplit$Fetal_Brain_408_AGTCAA_L009_R1_report.txt
   humanGene                            humanReplicate alignment RNAtype
53     ZFP28 Fetal_Brain_408_AGTCAA_L009_R1_report.txt         5     reg
55     RC3H1 Fetal_Brain_408_AGTCAA_L009_R1_report.txt         9     reg
56     IFI27 Fetal_Brain_408_AGTCAA_L009_R1_report.txt         4     reg

And another file used below is geneList of the form:

ABAT,Abat
ABCA1,Abca1
ABCA12,Abca12
ABCA2,Abca2
ABCA3,Abca17
ABCA4,Abca4
ABCA5,Abca5

now I want to do fisher's exact test between all the elements pair combination between ratSplit and humanSplit after some data manipulation. And ultimately want to write the result of fisher's test in a csv file. Right now I am doing double for loop. But I am wondering how I can use sapply or other related things to make it more efficient.

currently I am doing the following thing: Here I am first making a data.frame result where I store/append all the information got from fisher's test in a pair in each step. Then finally when the whole loop is done, I write the final result in a csv file. my understanding is to use sapply I need to transform the inside of the loop into a function and then call sapply. But I am not sure what's the best way to optimize it. Any help would be much appreciated

result <- data.frame(humanReplicate = "human_replicate", ratReplicate = "rat_replicate", pvalue = "p-value", alternative = "alternative_hypothesis", 
                     Conf.int1 = "conf.int1", Conf.int2 ="conf.int2", oddratio = "Odd_Ratio")
for(i in 1:length(ratSplit)) {
  for(j in 1:length(humanSplit)) {
    ratReplicateName <- names(ratSplit[i])
    humanReplicateName <- names(humanSplit[j])

    #merging above two based on the one-to-one gene mapping as in geneList defined above.
    mergedHumanData <-merge(geneList,humanSplit[[j]], by.x = "human", by.y = "humanGene")
    mergedRatData <- merge(geneList, ratSplit[[i]], by.x = "rat", by.y = "ratGene")

    mergedHumanData <- mergedHumanData[,c(1,2,4,5)] #rearrange column
    mergedRatData <- mergedRatData[,c(2,1,4,5)]  #rearrange column
    mergedHumanRatData <- rbind(mergedHumanData,mergedRatData) #now the columns are "human", "rat", "alignment", "RNAtype"

    agg <- aggregate(RNAtype ~ human+rat, data= mergedHumanRatData, FUN=getGeneType) #agg to make HmYn form
    HmRnTable <- table(agg$RNAtype) #table of HmRn ie RNAtype in human and rat.

    #now assign these numbers to variables HmYn. Consider cases when some form of HmRy is not present in the table. That's why
    #is.integer0 function is used
    HyRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRy"]), 0, HmRnTable[names(HmRnTable) == "HyRy"][[1]])
    HnRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRn"]), 0, HmRnTable[names(HmRnTable) == "HnRn"][[1]])
    HyRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRn"]), 0, HmRnTable[names(HmRnTable) == "HyRn"][[1]])
    HnRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRy"]), 0, HmRnTable[names(HmRnTable) == "HnRy"][[1]])

    contingencyTable <- matrix(c(HnRn,HnRy,HyRn,HyRy), nrow = 2)

    fisherTest <- fisher.test(contingencyTable) 
    newLine <- data.frame(t(c(humanReplicate = humanReplicateName, ratReplicate = ratReplicateName, pvalue = fisherTest$p,
                              alternative = fisherTest$alternative, Conf.int1 = fisherTest$conf.int[1], Conf.int2 =fisherTest$conf.int[2], 
                              oddratio = fisherTest$estimate[[1]])))


    result <-rbind(result,newLine)
  }
}

write.table(result, file = "newData5.csv", row.names = FALSE, append = FALSE, col.names = TRUE, sep = ",")

回答1:

It's difficult to test this since we're missing geneList, but I'm inferring that the code works correctly, so you merely want to speed things up. Here are some pointers to help:

Don't predefine result like this. By setting the first entry of each column to be a string of the column name, you ensure that all subsequent entries will be coerced into strings. Though this doesn't necessarily bite you since you're eventually writing to a CSV anyway, it's bad form. It will bite you if you intend to read it back in to R, since the first row will be used as the column names, but the second row will all be strings forcing subsequent data to be strings as well. (You'll then have to clean your own data, wasteful.)
At the end of your script, you call rbind. This might do alright for a while, but repeated calls to rbind will result in the whole data.frame being copied each time. With more rows being appended, this will cause a not-insignificant slowdown in your code. This can be fixed by one of two methods, listed below.
Since you use each of names(HmRnTable) == "HyRy" twice, my technique is to first save that to a vector (or scalar if you use which(...)), and then use this variable in the subsetting of HmRnTable. It will likely speed things up a little but may also make the code a little easier to read. In fact, you can shorten each of those four assignments to something like (untested):
```
idx <- is.integer0(HmRnTable[names(HmRnTable) == 'HyRy'])
HyRy <- HmRnTable[idx][[1]]
HyRy[idx] <- 0
## repeat for HyRn, HnRy, and HnRn
```
I strongly urge you to take the majority of the code and put it in a function that takes either two arguments (i and j) or four arguments (list1, index1, list2, index2). (Which you do depends on how much OCD you have with regards to variable scope references.) I'll assume that the function will literally return the results from fisher.test with no massaging. This will make for much easier testing while making the function, and for inclusion later on in this script. I'll reference function as myfisher(i,j) below.
I'm inferring that you have a high number of comparisons to run (since each iteration really shouldn't take that long). @Floo0's comment about outer can work, as can expand.grid. Either way, you are comparing each element of list1 with each element of list2. If you start with:
```
(eg <- expand.grid(i = 1:length(ratSplit),
                   j = 1:length(humanSplit)))
##    i j
## 1  1 1
## 2  2 1
## 3  3 1
## 4  4 1
## ...
```
This gives us an easy data.frame on which we can use apply. Honestly, though, I like the elegance of ddply in this case, since it easily goes from a data.frame to a data.frame, easily.
```
library(plyr)
ret <- ddply(eg, .(i, j), function(df) {
    with(myfisher(df$i, df$j),
         data.frame(pv = p.value, ci1 = conf.int[1], ci2 = conf.int[2],
                    alt = alternative))
})
```
Note that I expressly did not include humanReplicateName or ratReplicateName since those can be added before the ddply (or after, referencing ret$ instead of eg$) with:
```
eg$ratReplicateName <- names(ratSplit[ eg$i ])
eg$humanReplicateName <- names(humanSplit[ eg$i ])
```
and the names will magically be in the output, too. Less to deal with inside the loop.

Up until now, this will result in one data.frame that you can then save to a CSV.

I'll make one more recommendation that may be overkill depending on how long this will be running. I've found at times that my long-running scripts are interrupted, perhaps because I had to adjust something on-the-fly; I found a bug; or if the computer had to reboot. There is no easy way to allow mid-stream continuation, but I've worked some techniques to mitigate that.
Instead of ddply, use d_ply which does not assume the type of the return value. Instead of returning a (single-row) data.frame, immediately save that one row (with or without headers, your call) to a file. Though the below code is not truly "atomic" and therefore race-conditions may occur, it's robust enough to meet most of our needs:
```
library(plyr)
savedir <- './output'
ret <- ddply(eg, .(i, j), function(df) {
    fn <- file.path(savedir, sprintf('%s-%s.csv', df$i, df$j))
    if (! file.exists(fn)) {
        ret <- with(myfisher(df$i, df$j),
                    data.frame(pv = p.value, ci1 = conf.int[1], ci2 = conf.int[2],
                               alt = alternative))
        write.table(ret, file = fn, sep = ",", append = FALSE, 
                    row.names = FALSE, col.names = TRUE)
    }
})
```
One goodness of this is that if/when it is interrupted, the most you will have to do is look at all of the files within "./output/", delete files with 0 bytes, and then re-run and it will only execute on the missing files. Oh, and it gets better.

Parallelizing it. If you need to go this far (and some functions don't see as much improvement as others), you can instead use multiple cores on your system. You could do something like this:

library(parallel)
cl <- makeCluster(detectCores() - 1) # I like to keep one free
clusterEvalQ(cl, {
    load('myTwoLists.rda') # with ratSplit and humanSplit variables
    source('mycode.R') # with myfisher(i,j)
    ## any libraries you may want/need to add, if you get more advanced
})
eg <- expand.grid(i = 1:length(ratSplit),
                  j = 1:length(humanSplit))
eg$ratReplicateName <- names(ratSplit[ eg$i ])
eg$humanReplicateName <- names(humanSplit[ eg$i ])
ign <- parApply(cl, eg, 1, function(r) {
    i <- r[1] ; j <- r[2]
    fn <- file.path(savedir, sprintf('%s-%s.csv', i, j)
    if (! file.exists(fn)) {
        ret <- with(myfisher(i, j),
                    data.frame(ratName = r[3], humanName = r[4],
                               pv = p.value, ci1 = conf.int[1], ci2 = conf.int[2],
                               alt = alternative))
        write.table(ret, file = fn, sep = ",", append = FALSE, 
                    row.names = FALSE, col.names = TRUE)
    }
})
stopCluster(cl)

Notice that we aren't referencing the row as a data.frame anymore. I'd love it if Hadley's code worked natively with parallel. ~~(For all I know, it does and I just haven't found it yet!)~~ Edit: none of my past projects using parallel were using ddply, and vice versa, so I have never played with ddply(..., .parallel = TRUE) which uses foreach.

Caveat Emptor: this code has not been tested in this context, though it is almost copy/pasted from working code. Because of edits, I may be off-by-one or missing a paren. Hope it helps!

来源：https://stackoverflow.com/questions/25223808/applying-sapply-or-other-apply-function-instead-of-nested-for-loop-for-lists-of

标签

list

for-loop

apply

sapply