问题
I am looking for ways to speed up my code. I am looking into the apply
/ply
methods as well as data.table
. Unfortunately, I am running into problems.
Here is a small sample data:
ids1 <- c(1, 1, 1, 1, 2, 2, 2, 2)
ids2 <- c(1, 2, 3, 4, 1, 2, 3, 4)
chars1 <- c("aa", " bb ", "__cc__", "dd ", "__ee", NA,NA, "n/a")
chars2 <- c("vv", "_ ww_", " xx ", "yy__", " zz", NA, "n/a", "n/a")
data <- data.frame(col1 = ids1, col2 = ids2,
col3 = chars1, col4 = chars2,
stringsAsFactors = FALSE)
Here is a solution using loops:
library("plyr")
cols_to_fix <- c("col3","col4")
for (i in 1:length(cols_to_fix)) {
data[,cols_to_fix[i]] <- gsub("_", "", data[,cols_to_fix[i]])
data[,cols_to_fix[i]] <- gsub(" ", "", data[,cols_to_fix[i]])
data[,cols_to_fix[i]] <- ifelse(data[,cols_to_fix[i]]=="n/a", NA, data[,cols_to_fix[i]])
}
I initially looked at ddply
, but some methods I want to use only take vectors. Hence, I cannot figure out how to do ddply
across just certain columns one-by-one.
Also, I have been looking at laply
, but I want to return the original data.frame
with the changes. Can anyone help me? Thank you.
Based on the suggestions from earlier, here is what I tried to use from the plyr
package.
Option 1:
data[,cols_to_fix] <- aaply(data[,cols_to_fix],2, function(x){
x <- gsub("_", "", x,perl=TRUE)
x <- gsub(" ", "", x,perl=TRUE)
x <- ifelse(x=="n/a", NA, x)
},.progress = "text",.drop = FALSE)
Option 2:
data[,cols_to_fix] <- alply(data[,cols_to_fix],2, function(x){
x <- gsub("_", "", x,perl=TRUE)
x <- gsub(" ", "", x,perl=TRUE)
x <- ifelse(x=="n/a", NA, x)
},.progress = "text")
Option 3:
data[,cols_to_fix] <- adply(data[,cols_to_fix],2, function(x){
x <- gsub("_", "", x,perl=TRUE)
x <- gsub(" ", "", x,perl=TRUE)
x <- ifelse(x=="n/a", NA, x)
},.progress = "text")
None of these are giving me the correct answer.
apply
works great, but my data is very large and the progress bars from plyr
package would be a very nice. Thanks again.
回答1:
Here's a data.table
solution using set
.
require(data.table)
DT <- data.table(data)
for (j in cols_to_fix) {
set(DT, i=NULL, j=j, value=gsub("[ _]", "", DT[[j]], perl=TRUE))
set(DT, i=which(DT[[j]] == "n/a"), j=j, value=NA_character_)
}
DT
# col1 col2 col3 col4
# 1: 1 1 aa vv
# 2: 1 2 bb ww
# 3: 1 3 cc xx
# 4: 1 4 dd yy
# 5: 2 1 ee zz
# 6: 2 2 NA NA
# 7: 2 3 NA NA
# 8: 2 4 NA NA
First line reads: set in DT for all i(=NULL), and column=j the value gsub(..).
Second line reads: set in DT where i(=condn) and column=j with value NA_character_.
Note: Using PCRE (perl=TRUE
) has nice speed-up, especially on bigger vectors.
回答2:
Here is a data.table
solution, should be faster if your table is large.
The concept of := is an "update" of the columns. I believe that because of this you aren't copying the table internally again as a "normal" dataframe solution would.
require(data.table)
DT <- data.table(data)
fxn = function(col) {
col = gsub("[ _]", "", col, perl = TRUE)
col[which(col == "n/a")] <- NA_character_
col
}
cols = c("col3", "col4");
# lapply your function
DT[, (cols) := lapply(.SD, fxn), .SDcols = cols]
print(DT)
回答3:
No need for loops (for
or *ply
):
tmp <- gsub("[_ ]", "", as.matrix(data[,cols_to_fix]), perl=TRUE)
tmp[tmp=="n/a"] <- NA
data[,cols_to_fix] <- tmp
Benchmarks
I only benchmark Arun's data.table solution and my matrix solution. I assume that many columns need to be fixed.
Benchmark code:
options(stringsAsFactors=FALSE)
set.seed(45)
K <- 1000; N <- 1e5
foo <- function(K) paste(sample(c(letters, "_", " "), 8, replace=TRUE), collapse="")
bar <- function(K) replicate(K, foo(), simplify=TRUE)
data <- data.frame(id1=sample(5, K, TRUE),
id2=sample(5, K, TRUE)
)
data <- cbind(data, matrix(sample(bar(K), N, TRUE), ncol=N/K))
cols_to_fix <- as.character(seq_len(N/K))
library(data.table)
benchfun <- function() {
time1 <- system.time({
DT <- data.table(data)
for (j in cols_to_fix) {
set(DT, i=NULL, j=j, value=gsub("[ _]", "", DT[[j]], perl=TRUE))
set(DT, i=which(DT[[j]] == "n/a"), j=j, value=NA_character_)
}
})
data2 <- data
time2 <- system.time({
tmp <- gsub("[_ ]", "", as.matrix(data2[,cols_to_fix]), perl=TRUE)
tmp[tmp=="n/a"] <- NA
data2[,cols_to_fix] <- tmp
})
list(identical= identical(as.data.frame(DT), data2),
data.table_timing= time1[[3]],
matrix_timing=time2[[3]])
}
replicate(3, benchfun())
Benchmark results:
#100 columns to fix, nrow=1e5
# [,1] [,2] [,3]
#identical TRUE TRUE TRUE
#data.table_timing 6.001 5.571 5.602
#matrix_timing 17.906 17.21 18.343
#1000 columns to fix, nrow=1e4
# [,1] [,2] [,3]
#identical TRUE TRUE TRUE
#data.table_timing 4.509 4.574 4.857
#matrix_timing 13.604 14.219 13.234
#1000 columns to fix, nrow=100
# [,1] [,2] [,3]
#identical TRUE TRUE TRUE
#data.table_timing 0.052 0.052 0.055
#matrix_timing 0.134 0.128 0.127
#100 columns to fix, nrow=1e5 and including
#data1 <- as.data.frame(DT) in the timing
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#identical TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#data.table_timing 5.642 5.58 5.762 5.382 5.419 5.633 5.508 5.578 5.634 5.397
#data.table_returnDF_timing 5.973 5.808 5.817 5.705 5.736 5.841 5.759 5.833 5.689 5.669
#matrix_timing 20.89 20.3 19.988 20.271 19.177 19.676 20.836 20.098 20.005 19.409
data.table is faster only by a factor of three. This advantage could probably be even smaller, if we decide to change the data structure (as the data.table solution does) and keep it a matrix.
回答4:
I think you can do this with regular old apply
, which will call your cleanup function on each column (margin=2):
fxn = function(col) {
col <- gsub("_", "", col)
col <- gsub(" ", "", col)
col <- ifelse(col=="n/a", NA, col)
return(col)
}
data[,cols_to_fix] <- apply(data[,cols_to_fix], 2, fxn)
data
# col1 col2 col3 col4
# 1 1 1 aa vv
# 2 1 2 bb ww
# 3 1 3 cc xx
# 4 1 4 dd yy
# 5 2 1 ee zz
# 6 2 2 <NA> <NA>
# 7 2 3 <NA> <NA>
# 8 2 4 <NA> <NA>
Edit: it sounds like you're requiring the use of the plyr
package. I'm not an expert in plyr
, but this seemed to work:
library(plyr)
data[,cols_to_fix] <- t(laply(data[,cols_to_fix], fxn))
回答5:
Here's a benchmark of all the different answers:
First, all the answers as separate functions:
1) Arun's
arun <- function(data, cols_to_fix) {
DT <- data.table(data)
for (j in cols_to_fix) {
set(DT, i=NULL, j=j, value=gsub("[ _]", "", DT[[j]], perl=TRUE))
set(DT, i=which(DT[[j]] == "n/a"), j=j, value=NA_character_)
}
return(DT)
}
2) Martin's
martin <- function(data, cols) {
DT <- data.table(data)
colfun = function(col) {
col <- gsub("_", "", col)
col <- gsub(" ", "", col)
col <- ifelse(col=="n/a", NA, col)
}
DT[, (cols) := lapply(.SD, colfun), .SDcols = cols]
return(DT)
}
3) Roland's
roland <- function(data, cols_to_fix) {
tmp <- gsub("[_ ]", "", as.matrix(data[,cols_to_fix]))
tmp[tmp=="n/a"] <- NA
data[,cols_to_fix] <- tmp
return(data)
}
4) BrodieG's
brodieg <- function(data, cols_to_fix) {
fix_fun <- function(x) gsub("(_| )", "", ifelse(x == "n/a", NA_character_, x))
data[, cols_to_fix] <- apply(data[, cols_to_fix], 2, fix_fun)
return(data)
}
5) Josilber's
josilber <- function(data, cols_to_fix) {
colfun2 <- function(col) {
col <- gsub("_", "", col)
col <- gsub(" ", "", col)
col <- ifelse(col=="n/a", NA, col)
return(col)
}
data[,cols_to_fix] <- apply(data[,cols_to_fix], 2, colfun2)
return(data)
}
2) benchmarking function:
We'll run this function 3 times and take the minimum of the run (removes cache effects) to be the runtime:
bench <- function(data, cols_to_fix) {
ans <- c(
system.time(arun(data, cols_to_fix))["elapsed"],
system.time(martin(data, cols_to_fix))["elapsed"],
system.time(roland(data, cols_to_fix))["elapsed"],
system.time(brodieg(data, cols_to_fix))["elapsed"],
system.time(josilber(data, cols_to_fix))["elapsed"]
)
}
3) On (slightly) big data with just 2 cols to fix (like in OP's example here):
require(data.table)
set.seed(45)
K <- 1000; N <- 1e5
foo <- function(K) paste(sample(c(letters, "_", " "), 8, replace=TRUE), collapse="")
bar <- function(K) replicate(K, foo(), simplify=TRUE)
data <- data.frame(id1=sample(5, N, TRUE),
id2=sample(5, N, TRUE),
col3=sample(bar(K), N, TRUE),
col4=sample(bar(K), N, TRUE)
)
rown <- c("arun", "martin", "roland", "brodieg", "josilber")
coln <- paste("run", 1:3, sep="")
cols_to_fix <- c("col3","col4")
ans <- matrix(0L, nrow=5L, ncol=3L)
for (i in 1:3) {
print(i)
ans[, i] <- bench(data, cols_to_fix)
}
rownames(ans) <- rown
colnames(ans) <- coln
# run1 run2 run3
# arun 0.149 0.140 0.142
# martin 0.643 0.629 0.621
# roland 1.741 1.708 1.761
# brodieg 1.926 1.919 1.899
# josilber 2.067 2.041 2.162
回答6:
The apply
version is the way to go. Looks like @josilber came up with the same answer, but this one is slightly different (note regexp).
fix_fun <- function(x) gsub("(_| )", "", ifelse(x == "n/a", NA_character_, x))
data[, cols_to_fix] <- apply(data[, cols_to_fix], 2, fix_fun)
More importantly, generally you want to use ddply
and data.table
when you want to do split-apply-combine analysis. In this case, all your data belongs to the same group (there aren't any subgroups you're doing anything different with), so you might as well use apply
.
The 2
at the center of the apply
statement means we want to subset the input by the 2nd dimension, and pass the result (in this case vectors, each representing a column from your data frame in cols_to_fix
) to the function that does the work. apply
then re-assembles the result, and we assign it back to the columns in cols_to_fix
. If we had used 1
instead, apply
would have passed the rows in our data frame to the function. Here is the result:
data
# col1 col2 col3 col4
# 1 1 1 aa vv
# 2 1 2 bb ww
# 3 1 3 cc xx
# 4 1 4 dd yy
# 5 2 1 ee zz
# 6 2 2 <NA> <NA>
# 7 2 3 <NA> <NA>
# 8 2 4 <NA> <NA>
If you do have sub-groups, then I recommend you use data.table
. Once you get used to the syntax it's hard to beat for convenience and speed. It will also do efficient joins across data sets.
来源:https://stackoverflow.com/questions/21056297/r-plyr-data-table-apply-certain-columns-of-data-frame