replace some column values from a data.frame based on another data.frame

前端 未结 2 1300
悲哀的现实
悲哀的现实 2021-01-14 06:41

I have two data.frames, (df1, df2) and I would like to replace the values in columns P1-P10 the letters with the values of df1$V2 but keeping the first two colu

2条回答
  •  慢半拍i
    慢半拍i (楼主)
    2021-01-14 07:07

    Try some *pply magic:

    lookup<-tapply(df1$V2, df1$V1, unique) #Creates a lookup table
    lookup.function<-function(x) as.numeric(lookup[as.character(x)]) #The function
    df4<-data.frame(df2[,1:2], apply(df2[,3:12], 2,lookup.function )) #Builds the output
    

    Update:

    The *pply family is much faster than merge, at least an order of magnitude. Check this out

    num<-1000
    df1 = data.frame(V1=LETTERS, V2=rnorm(26))
    df2<-data.frame(cbind(first=1:num,second=1:num, matrix(sample(LETTERS, num^2, replace=T), nrow=num, ncol=num)))
    
    
    start<-Sys.time()
    lookup<-tapply(df1$V2, df1$V1, unique)
    lookup.function<-function(x) as.numeric(lookup[as.character(x)])
    df4<-data.frame(cbind(df2[,1:2], data.frame(apply(df2[,3:(num+2)], 2, lookup.function ))))
    (difftime(Sys.time(),start))
    
    
    start<-Sys.time()
    df4.merge <- "[<-"(df2, 3:num, value = df1[match(as.character(unlist(df2[3:num])), as.character(df1[[1]])), 2])
    (difftime(Sys.time(),start))
    
    sum(df4==df4.merge)==num^2
    

    For 3000 columns and rows the *pply combination needs 4.3s whereas merge needs about 22s on my slow Intel. And it scales nicely. For 4000 columns and rows the respective times are 7.4 sec and 118 sec.

提交回复
热议问题