Split up ngrams in document-feature matrix (quanteda)

问题

I was wonderig if it's possible to split up ngram-features in a document-feature matrix (dfm) in such a way that e.g. a bigram results in two separate unigrams?

head(dfm, n = 3, nfeature = 4)

docs       in_the great plenary emission_reduction
  10752099      3     1       1                  3
  10165509      8     0       0                  3
  10479890      4     0       0                  1

So, the above dfm would result in something like this:

head(dfm, n = 3, nfeature = 4)

docs       in great plenary emission the reduction
  10752099  3     1       1        3   3         3
  10165509  8     0       0        3   8         3
  10479890  4     0       0        1   4         1

For better understanding: I got the ngrams in the dfm from translating the features from German to English. Compounds ("Emissionsminderung") are quiet common in German but not in English ("emission reduction").

Thank you in advance!

EDIT: The following can be used as reproducible example.

library(quanteda)

eg.txt <- c('increase in_the great plenary', 
            'great plenary emission_reduction', 
            'increase in_the emission_reduction emission_increase')
eg.corp <- corpus(eg.txt)
eg.dfm <- dfm(eg.corp)

head(eg.dfm)

回答1:

I don't know if the best approach (it might use a lot of RAM since it turns the sparse dfm to a data.frame/matrix), but it should work :

# turn the dft into a matrix (transposing it)
DF <- as.data.frame(eg.dfm)
MX <- t(DF)
# split the current column names by '_'
colsSplit <- strsplit(colnames(DF),'_')
# replicate the rows of the matrix and give them the new split row names
MX <-MX[unlist(lapply(1:length(colsSplit),function(idx) rep(idx,length(colsSplit[[idx]])))),]
rownames(MX) <- unlist(colsSplit)
# aggregate the matrix rows having the same name and transpose again
MX2 <- t(do.call(rbind,by(MX,rownames(MX),colSums)))
# turn the matrix into a dfm
eg.dfm.res <- as.dfm(MX2)

Result :

> eg.dfm.res
Document-feature matrix of: 3 documents, 7 features (33.3% sparse).
3 x 7 sparse Matrix of class "dfmSparse"
       features
docs    emission great in increase plenary reduction the
  text1        0     1  1        1       1         0   1
  text2        1     1  0        0       1         1   0
  text3        2     0  1        2       0         1   1

来源：https://stackoverflow.com/questions/44158882/split-up-ngrams-in-document-feature-matrix-quanteda

标签

quanteda