Context

I have been trying to implement the algorithm recently proposed in this paper. Given a large amount of text (corpus), the algorithm is supposed to return characteristic n-grams (i.e., sequence of n words) of the corpus. The user can decide the appropriate n, and at the moment I am trying with n = 2-6 as in the original paper. In other words, using the algorithm, I want to extract 2- to 6-grams that characterize the corpus.

I was able to implement the part that calculates the score based on which characteristic n-grams are identified, but have been struggling to eliminate non-characteristic ones.

Data

I have a list called token.df that contains five data frames including all the n-grams that appear in the corpus. Each data frame corresponds to each n in n-grams. For example, token.df[[2]] includes all the bigrams (2-grams) and their scores (called mi below) in the alphabetical order.

> head(token.df[[2]])
w1    w2      mi
_      eos  17.219346
_   global   7.141789
_     what   8.590394
0        0   2.076421
0       00   5.732846
0      000   3.426785

Here, the bigram 0 0 (though they are not quite words as such) has the score of 2.076421. Since the data frames include all the n-grams that appear in the corpus, they each have over one million rows.

> sapply(token.df, nrow)
[[1]]
NULL

[[2]]
[1] 1006059  # number of unique bigrams in the corpus

[[3]]
[1] 2684027  # number of unique trigrams in the corpus

[[4]]
[1] 3635026  # number of unique 4-grams in the corpus

[[5]]
[1] 3965120  # number of unique 5-grams in the corpus

[[6]]
[1] 4055048  # number of unique 6-grams in the corpus

Task

I want to identify which n-grams to retain and which ones to discard. For this purpose, the algorithm does the following.

bigrams
- It retains the bigrams whose scores are higher than those of the trigrams whose first two words match the bigrams.
3-5 grams
- For each n-gram where n = {3, 4, 5}, it looks at
  - the n-1 grams that match the first n-1 words of the n-gram and
  - the n+1 grams whose first n words match the n-gram.
- The algorithm retains the n-gram only if its score is higher than the scores of the n-1 grams and the n+1 grams identified above.
6-grams
- It retains the 6-grams whose scores are higher than those of the 5-grams that match the first five words of the 6-grams.

Example

> token.df[[2]][15, ]
 w1  w2       mi
  0 001 10.56292
> token.df[[3]][33:38, ]
 w1  w2       w3        mi
  0 001     also  3.223091
  0 001 although  5.288097
  0 001      and  2.295903
  0 001      but  4.331710
  0 001 compared  6.270625
  0 001      dog 11.002312
> token.df[[4]][46:48, ]
 w1  w2            w3      w4        mi
  0 001      compared      to  5.527626
  0 001           dog walkers 10.916028
  0 001 environmental concern 10.371769

Here, the bigram 0 001 is not retained because one of the trigrams whose first two words match the bigram (0 001 dog) has a higher score than the bigram (11.002312 > 10.56292). The trigram 0 001 dog is retained because its score (11.002312) is higher than that of the bigram that matches the first two words of the trigram (0 001; score = 10.56292) and that of the 4-gram whose first three words match the trigram (0 001 dog walkers; score = 10.916028).

Problem and Failed Attempts

What I would like to know is an efficient way to achieve the above. In order to determine which bigrams to retain, for example, I need to find out for each row of token.df[[2]] which rows in token.df[[3]] have the first two words identical to the bigram in concern. However, since the number of rows is large, my iteration approaches below take too long time to run. They focus on the case of bigrams because the task looked simpler than the case of 3-5 grams.

The for loop approach.
Since the code below goes over all the rows of token.df[[3]] at each iteration, it was estimated to take months to run. Though slightly better, similar was the case with by().

# for loop
retain <- numeric(nrow(token.df[[2]]))
for (i in 1:nrow(token.df[[2]])) {
    mis <- token.df[[3]]$mi[token.df[[2]][i, ]$w1 == token.df[[3]][ , 1] & token.df[[2]][i, ]$w2 == token.df[[3]][ , 2]]
    retain[i] <- ifelse(token.df[[2]]$mi[i] > max(mis), TRUE, FALSE)
}

# by
mis <- by(token.df[[2]], 1:nrow(token.df[[2]]), function(x) token.df[[3]]$mi[x$w1 == token.df[[3]]$w1 & x$w2 == token.df[[3]]$w2])
retain <- sapply(seq(mis), function(i) token.df[[2]]$mi[i] > max(mis[[i]]))

The pointer approach.
The problem with the above is the large number of iterations over a (vertically) long data frame. To alleviate the issue, I thought I can use the fact that n-grams are alphabetically sorted in each data frame and employ a kind of pointer indicating at which row to start looking. However, this approach, too, takes too long to run (at least several days).

retain <- numeric(nrow(token.df[[2]]))
nrow <- nrow(token.df[[3]]) # number of rows of the trigram data frame
pos <- 1 # pointer
for (i in seq(nrow(token.df[[2]]))) {
    j <- 1
    target.rows <- numeric(10)
    while (TRUE) {
        if (pos == nrow + 1 || !all(token.df[[2]][i, 1:2] == token.df[[3]][pos, 1:2])) break
        target.rows[j] <- pos
        pos <- pos + 1
        if (j %% 10 == 0) target.rows <- c(target.rows, numeric(10))
        j <- j + 1
    }
    target.rows <- target.rows[target.rows != 0]
    retain[i] <- ifelse(token.df[[2]]$mi[i] > max(token.df[[3]]$mi[target.rows]), TRUE, FALSE)
}

Is there a way to do this task within a reasonable amount of time (e.g., overnight)? Now that iteration approaches have been in vain, I am wondering if any vectorization is possible. But I am open to any means to speed up the process.

The data have a tree structure in that one bigram is divided into one or more trigrams, each of which in turn is divided into one or more 4-grams, and so forth. I am not sure how best to process this kind of data.

Reproducible Example

I thought about putting up part of the real data I'm using, but cutting down the data ruins the whole point of the issue. I assume people do not want to download the whole data set of 250MB just for this, nor do I have a right to upload it. Below is the random data set that is still smaller than that I'm using but helps to experience the problem. With the code above (the pointer approach), it takes my computer 4-5 seconds to process the first 100 rows of token.df[[2]] below and it presumably takes 12 hours just to process all the bigrams.

token.df <- list()
types <- combn(LETTERS, 4, paste, collapse = "")
set.seed(1)
data <- data.frame(matrix(sample(types, 6 * 1E6, replace = TRUE), ncol = 6), stringsAsFactors = FALSE)
colnames(data) <- paste0("w", 1:6)
data <- data[order(data$w1, data$w2, data$w3, data$w4, data$w5, data$w6), ]
set.seed(1)
for (n in 2:6) token.df[[n]] <- cbind(data[ , 1:n], mi = runif(1E6))

Any ideas to speed up the code are highly appreciated.

joran

The following runs in under 7 seconds on my machine, for all the bigrams:

library(dplyr)
res <- inner_join(token.df[[2]],token.df[[3]],by = c('w1','w2'))
res <- group_by(res,w1,w2)
bigrams <- filter(summarise(res,keep = all(mi.y < mi.x)),keep)

There's nothing special about dplyr here. An equally fast (or faster) solution could surely be done using data.table or directly in SQL. You just need to switch to using joins (as in SQL) rather than iterating through everything yourself. In fact, I wouldn't be surprised if simply using merge in base R and then aggregate wouldn't be orders of magnitude faster than what you're doing now. (But you really should be doing this with data.table, dplyr or directly in a SQL data base).

Indeed, this:

library(data.table)
dt2 <- setkey(data.table(token.df[[2]]),w1,w2)
dt3 <- setkey(data.table(token.df[[3]]),w1,w2)
dt_tmp <- dt3[dt2,allow.cartesian = TRUE][,list(k = all(mi < mi.1)),by = c('w1','w2')][(k)]

is even faster still (~2x). I'm not even really sure that I've squeezed all the speed I could have out of either package, to be honest.

(edit from Rick. Attempted as comment, but syntax was getting messed up)
If using data.table, this should be even faster, as data.table has a by-without-by feature (See ?data.table for more info):

 dt_tmp <- dt3[dt2,list(k = all(mi < i.mi)), allow.cartesian = TRUE][(k)]

Note that when joining data.tables you can preface the column names with i. to indicate to use the column from specifically the data.table in the i= argument.

来源：https://stackoverflow.com/questions/21371124/speeding-up-the-processing-of-large-data-frames-in-r

标签

dataframe

corpus