Calculate word co-occurance matrix in r

后端 未结 1 1328
既然无缘
既然无缘 2021-01-22 08:12

I would like to calculate a word co-occurance matrix in R. I have the following data frame of sentences -

dat <- as.data.frame(\"The boy is tall.\", header =         


        
1条回答
  •  离开以前
    2021-01-22 08:36

    library(tm)
    library(dplyr)
    dat      <- as.data.frame("The boy is tall.", header = F, stringsAsFactors = F)
    dat[2,1] <- c("The girl is short.")
    dat[3,1] <- c("The tall boy and the short girl are friends.")
    
    ds  <- Corpus(DataframeSource(dat))
    dtm <- DocumentTermMatrix(ds, control=list(wordLengths=c(1,Inf)))
    
    X         <- inspect(dtm)
    out       <- crossprod(X)  # Same as: t(X) %*% X
    diag(out) <- 0             # rm own-word occurences
    out
    
            Terms
    Terms    boy friend girl short tall the
      boy      0      1    1     1    2   2
      friend   1      0    1     1    1   1
      girl     1      1    0     2    1   2
      short    1      1    2     0    1   2
      tall     2      1    1     1    0   2
      the      2      1    2     2    2   0
    

    You may also want to remove stop words like "the", i.e.

    ds <- tm_map(ds, stripWhitespace)
    ds <- tm_map(ds, removePunctuation)
    ds <- tm_map(ds, stemDocument)
    ds <- tm_map(ds, removeWords, c("the", stopwords("english")))
    ds <- tm_map(ds, removeWords, c("the", stopwords("spanish")))
    

    0 讨论(0)
提交回复
热议问题