Optimization of an R loop taking 18 hours to run

人走茶凉 提交于 2019-12-12 00:15:20


I've got an R code that works and does what I want but It takes a huge time to run. Here is an explanation of what the code does and the code itself.

I've got a vector of 200000 line containing street adresses (String) : data. Example :

> data[150000,]
"15 rue andre lalande residence marguerite yourcenar 91000 evry france" 

And I have a matrix of 131x2 string elements which are 5grams (part of word) and the ids of the bags of NGrams (example of a 5Grams bag : ["stack", "tacko", "ackov", "ckover", ",overf", ... ] ) : list_ngrams

Example of list_ngrams :

  idSac ngram
1     4 stree
2     4 tree_ 
3     4 _stre
4     4 treet
5     5 avenu
6     5 _aven
7     5 venue
8     5 enue_

I have also a 200000x31 numerical matrix initialized with 0 : idv_x_bags

In total I have 131 5-grams and 31 bags of 5-grams.

I want to loop the string addresses and check whether it contains one of the n-grams in my list or not. If it does, I put one in the corresponding column which represents the id of the bag that contains the 5-gram. Example :

In this address : "15 rue andre lalande residence marguerite yourcenar 91000 evry france". The word "residence" exists in the bag ["resid","eside","dence",...] which the id is 5. So I'm gonna put 1 in the column called 5. Therefore the corresponding line "idv_x_bags" matrix will look like the following :

> idv_x_sacs[150000,]
  4   5   6   8  10  12  13  15  17  18  22  26  29  34  35  36  42  43  45  46  47  48  52  55  81  82 108 114 119 122 123 
  0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 

Here is the code that does :

idv_x_sacs <- matrix(rep(0,nrow(data)*31),nrow=nrow(data),ncol=31)
colnames(idv_x_sacs) <- as.vector(sqldf("select distinct idSac from list_ngrams order by idSac"))$idSac

    for(i in 1:nrow(idv_x_bags)) 
        for(ngram in list_ngrams$ngram)
          idSac <- sqldf(sprintf("select idSac from list_ngramswhere ngram='%s'",ngram))[[1]]
          idv_x_bags[i,as.character(idSac)] <- 1

The code does perfectly what I aim to do, but it takes about 18 hours which is huge. I tried to recode it with c++ using Rcpp library but I encountered many problems. I'm tried to recode it using apply, but I couldn't do it. Here is what I did :


I need some help with coding my loop using apply or some other method that run faster that the current one. Thank you very much.


Check this one and run the simple example step by step to see how it works. My N-Grams don't make much sense, but it will work with actual N_Grams as well.


 # your example dataset
 dt_sen = data.frame(sen = c("this is a good thing", "this is bad"), stringsAsFactors = F)
 dt_ngr = data.frame(id_ngr = c(2,2,2,3,3,3),
                     ngr = c("th","go","tt","drf","ytu","bad"), stringsAsFactors = F)

 # sentence dataset

    1 this is a good thing
    2          this is bad

 #ngrams dataset

  id_ngr ngr
1      2  th
2      2  go
3      2  tt
4      3 drf
5      3 ytu
6      3 bad

 # create table of matches
 expand.grid(unique(dt_sen$sen), unique(dt_ngr$id_ngr)) %>%
   data.frame() %>%
   rename(sen = Var1,
          id_ngr = Var2) %>%
   left_join(dt_ngr, by = "id_ngr") %>%
   group_by(sen, id_ngr,ngr) %>%
   do(data.frame(match = grepl(.$ngr,.$sen))) %>%
   group_by(sen,id_ngr) %>%
   summarise(sum_success = sum(match)) %>%
   mutate(match = ifelse(sum_success > 0,1,0)) -> dt_full

Source: local data frame [4 x 4]
Groups: sen

                   sen id_ngr sum_success match
1 this is a good thing      2           2     1
2 this is a good thing      3           0     0
3          this is bad      2           1     1
4          this is bad      3           1     1

 # reshape table
 dt_full %>% dcast(., sen~id_ngr, value.var = "match")
                   sen 2 3
1 this is a good thing 1 0
2          this is bad 1 1

