I would like to calculate a word co-occurance matrix in R. I have the following data frame of sentences -
dat <- as.data.frame(\"The boy is tall.\", header =
library(tm)
library(dplyr)
dat <- as.data.frame("The boy is tall.", header = F, stringsAsFactors = F)
dat[2,1] <- c("The girl is short.")
dat[3,1] <- c("The tall boy and the short girl are friends.")
ds <- Corpus(DataframeSource(dat))
dtm <- DocumentTermMatrix(ds, control=list(wordLengths=c(1,Inf)))
X <- inspect(dtm)
out <- crossprod(X) # Same as: t(X) %*% X
diag(out) <- 0 # rm own-word occurences
out
Terms Terms boy friend girl short tall the boy 0 1 1 1 2 2 friend 1 0 1 1 1 1 girl 1 1 0 2 1 2 short 1 1 2 0 1 2 tall 2 1 1 1 0 2 the 2 1 2 2 2 0
You may also want to remove stop words like "the", i.e.
ds <- tm_map(ds, stripWhitespace)
ds <- tm_map(ds, removePunctuation)
ds <- tm_map(ds, stemDocument)
ds <- tm_map(ds, removeWords, c("the", stopwords("english")))
ds <- tm_map(ds, removeWords, c("the", stopwords("spanish")))