Running “apply” command on a very large data frame

问题

I have a tibble in R that has dimension of 15,000,000 x 140. Size-wise it's about 6 gb.

I want to check if any of columns 11-40 for a given row start in a specific list. I want to get out a vector of 1 & 0's that is then 15,000,000 long.

I can do this using the following:

subResult <- apply(rawData[,11:40], c(1,2), function(x){substring(x,1,3) %in% c("295", "296", "297", "298", "299")})

result <- apply(subResult, 1, sum)

Problem is that this is way too slow -- it would take over 1 day to do just for the first line.

Is there any way to do this faster -- perhaps directly through dplyr or data.table?

Thank you!

Here's a sampling of the data trimmed to just columns 11-40.

!> head(rawData)
 # A tibble: 6 x 30                                                                                                                                                                               
   X1    X2    X3    X4    X5    X6    X7    X8    X9    X10   X11   X12   X13
   <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
 1 39402 39451 3fv3i 19593 fk20 14p4  59304  329fj2 NA    NA    NA    NA    NA
 2 39422 f203ff vmio2  vo2493  19149 59833 13404 394034 43920  349304   59302 1934 34834
 3 3432f32 fe493  43943 H2344 53049  V602  3124  K148 K13  NA    NA    NA    NA
 # ... with 17 more variables: X14 <chr>, X15 <chr>, X16 <chr>, X17 <chr>,                                                                                                                         
 #   X18 <chr>, X19 <chr>, X20 <chr>, X21 <chr>, X22 <chr>, X23 <chr>,                                                                                                                             
 #   X24 <chr>, X25 <chr>, X26 <chr>, X27 <chr>, X28 <chr>, X29 <chr>, X30 <chr>

回答1:

Based on the description, this can be done either with tidyverse

library(tidyverse)
rawData %>%
   select(11:40) %>% #select the columns
   #convert to logical columns
   mutate_all(funs(substring(.,1,3) %in% c("295", "296", "297", "298", "299"))) %>% 
   reduce('+') %>% #get the rowwise sum
   mutate(rawData, newcol = .) # assign a new column to the original data

Or with data.table by converting the 'data.frame' to 'data.table' (setDT(rawData)), specify the columns of interest in .SDcols, loop through the columns, convert it to logical by using the OP's condition, Reduce by taking the sum of each row and assign (:=) to 'newcol'

library(data.table)
setDT(rawData)[, newCol := Reduce('+', lapply(.SD, function(x) 
      substring(x, 1, 3) %chin% c("295", "296", "297", "298", "299"))), 
     .SDcols = 11:40]

回答2:

My comments:

apply converts your data to a matrix
a data frame is above all a list, not a matrix
substring() is a vectorized function (%in% too)

So, I would do:

sapply(rawData[11:40], function(var) {
  substring(var, 1, 3) %in% c("295", "296", "297", "298", "299")
})

and then use rowSums() instead of apply(subResult, 1, sum).

回答3:

Try to use Rcpp package.

Here is a simple C++ program which takes two string vectors, and checks if 3 characters of elements in first are equal to the second one. So it will output logical matrix of size length(first vector) x length(second vector).

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
LogicalMatrix IndicatorMatrix(std::vector<std::string> target, std::vector<std::string> tocheck) {

  int nrows = target.size();
  int ncols = tocheck.size();

  LogicalMatrix ind(nrows, ncols);

  for(int r=0; r<nrows; r++) {
    for(int c=0; c<ncols; c++) {

      bool found = target[r].substr(0,3) == tocheck[c];
      ind(r,c) = found;

    }
  }

  return ind;

}

After that you can source this program into R and use your IndicatorMatrix function as if it would be a R function object.

library(Rcpp)
sourceCpp("C:/Users/Desktop/indicatorMatrix.cpp")

rep("123456", 15000000) -> x
df <- data.frame(x,x,x,x,x,x,x,x, stringsAsFactors=FALSE)
y <- c("123", "124", "345", "231", "675", "344", "222")


t1 <- Sys.time()
out <- lapply(1:length(df), function(col) {

  res <- IndicatorMatrix(unlist(df[,col]), y)
  res

})
t2 <- Sys.time()
t2-t1

Program searched for 8 3-character strings in 8 column data frame with 15 milions of rows in about 100 seconds. So this could be right direction for you.

来源：https://stackoverflow.com/questions/49645059/running-apply-command-on-a-very-large-data-frame

标签

dataframe

parallel-processing

tibble