Finding 2 & 3 word Phrases Using R TM Package

后端 未结 7 1918
死守一世寂寞
死守一世寂寞 2020-11-28 04:26

I am trying to find a code that actually works to find the most frequently used two and three word phrases in R text mining package (maybe there is another package for it th

相关标签:
7条回答
  • 2020-11-28 04:57

    This is part 5 of the FAQ of the tm package:

    5. Can I use bigrams instead of single tokens in a term-document matrix?

    Yes. RWeka provides a tokenizer for arbitrary n-grams which can be directly passed on to the term-document matrix constructor. E.g.:

      library("RWeka")
      library("tm")
    
      data("crude")
    
      BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
      tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
    
      inspect(tdm[340:345,1:10])
    
    0 讨论(0)
  • 2020-11-28 04:57

    Try this code.

    library(tm)
    library(SnowballC)
    library(class)
    library(wordcloud)
    
    keywords <- read.csv(file.choose(), header = TRUE, na.strings=c("NA","-","?"))
    keywords_doc <- Corpus(VectorSource(keywords$"use your column that you need"))
    keywords_doc <- tm_map(keywords_doc, removeNumbers)
    keywords_doc <- tm_map(keywords_doc, tolower)
    keywords_doc <- tm_map(keywords_doc, stripWhitespace)
    keywords_doc <- tm_map(keywords_doc, removePunctuation)
    keywords_doc <- tm_map(keywords_doc, PlainTextDocument)
    keywords_doc <- tm_map(keywords_doc, stemDocument)
    

    This is the bigrams or tri grams section that you could use

    BigramTokenizer <-  function(x)
    unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
    # creating of document matrix
    keywords_matrix <- TermDocumentMatrix(keywords_doc, control = list(tokenize = BigramTokenizer))
    
    # remove sparse terms 
    keywords_naremoval <- removeSparseTerms(keywords_matrix, 0.95)
    
    # Frequency of the words appearing
    keyword.freq <- rowSums(as.matrix(keywords_naremoval))
    subsetkeyword.freq <-subset(keyword.freq, keyword.freq >=20)
    frequentKeywordSubsetDF <- data.frame(term = names(subsetkeyword.freq), freq = subsetkeyword.freq) 
    
    # Sorting of the words
    frequentKeywordDF <- data.frame(term = names(keyword.freq), freq = keyword.freq)
    frequentKeywordSubsetDF <- frequentKeywordSubsetDF[with(frequentKeywordSubsetDF, order(-frequentKeywordSubsetDF$freq)), ]
    frequentKeywordDF <- frequentKeywordDF[with(frequentKeywordDF, order(-frequentKeywordDF$freq)), ]
    
    # Printing of the words
    wordcloud(frequentKeywordDF$term, freq=frequentKeywordDF$freq, random.order = FALSE, rot.per=0.35, scale=c(5,0.5), min.freq = 30, colors = brewer.pal(8,"Dark2"))
    

    Hope this helps. This is an entire code that you could use.

    0 讨论(0)
  • 2020-11-28 05:06

    This is my own made up creation for different purposes but I think may applicable to your needs too:

    #User Defined Functions
    Trim <- function (x) gsub("^\\s+|\\s+$", "", x)
    
    breaker <- function(x) unlist(strsplit(x, "[[:space:]]|(?=[.!?*-])", perl=TRUE))
    
    strip <- function(x, digit.remove = TRUE, apostrophe.remove = FALSE){
        strp <- function(x, digit.remove, apostrophe.remove){
            x2 <- Trim(tolower(gsub(".*?($|'|[^[:punct:]]).*?", "\\1", as.character(x))))
            x2 <- if(apostrophe.remove) gsub("'", "", x2) else x2
            ifelse(digit.remove==TRUE, gsub("[[:digit:]]", "", x2), x2)
        }
    unlist(lapply(x, function(x) Trim(strp(x =x, digit.remove = digit.remove, 
        apostrophe.remove = apostrophe.remove)) ))
    }
    
    unblanker <- function(x)subset(x, nchar(x)>0)
    
    #Fake Text Data
    x <- "I like green eggs and ham.  They are delicious.  They taste so yummy.  I'm talking about ham and eggs of course"
    
    #The code using Base R to Do what you want
    breaker(x)
    strip(x)
    words <- unblanker(breaker(strip(x)))
    textDF <- as.data.frame(table(words))
    textDF$characters <- sapply(as.character(textDF$words), nchar)
    textDF2 <- textDF[order(-textDF$characters, textDF$Freq), ]
    rownames(textDF2) <- 1:nrow(textDF2)
    textDF2
    subset(textDF2, characters%in%2:3)
    
    0 讨论(0)
  • 2020-11-28 05:11

    You can pass in a custom tokenizing function to tm's DocumentTermMatrix function, so if you have package tau installed it's fairly straightforward.

    library(tm); library(tau);
    
    tokenize_ngrams <- function(x, n=3) return(rownames(as.data.frame(unclass(textcnt(x,method="string",n=n)))))
    
    texts <- c("This is the first document.", "This is the second file.", "This is the third text.")
    corpus <- Corpus(VectorSource(texts))
    matrix <- DocumentTermMatrix(corpus,control=list(tokenize=tokenize_ngrams))
    

    Where n in the tokenize_ngrams function is the number of words per phrase. This feature is also implemented in package RTextTools, which further simplifies things.

    library(RTextTools)
    texts <- c("This is the first document.", "This is the second file.", "This is the third text.")
    matrix <- create_matrix(texts,ngramLength=3)
    

    This returns a class of DocumentTermMatrix for use with package tm.

    0 讨论(0)
  • 2020-11-28 05:18

    The corpus library has a function called term_stats that does what you want:

    library(corpus)
    corpus <- gutenberg_corpus(55) # Project Gutenberg #55, _The Wizard of Oz_
    text_filter(corpus)$drop_punct <- TRUE # ignore punctuation
    term_stats(corpus, ngrams = 2:3)
    ##    term             count support
    ## 1  of the             336       1
    ## 2  the scarecrow      208       1
    ## 3  to the             185       1
    ## 4  and the            166       1
    ## 5  said the           152       1
    ## 6  in the             147       1
    ## 7  the lion           141       1
    ## 8  the tin            123       1
    ## 9  the tin woodman    114       1
    ## 10 tin woodman        114       1
    ## 11 i am                84       1
    ## 12 it was              69       1
    ## 13 in a                64       1
    ## 14 the great           63       1
    ## 15 the wicked          61       1
    ## 16 wicked witch        60       1
    ## 17 at the              59       1
    ## 18 the little          59       1
    ## 19 the wicked witch    58       1
    ## 20 back to             57       1
    ## ⋮  (52511 rows total)
    

    Here, count is the number of appearances, and support is the number of documents containing the term.

    0 讨论(0)
  • 2020-11-28 05:18

    I add a similar problem by using tm and ngram packages. After debugging mclapply, I saw there where problems on documents with less than 2 words with the following error

       input 'x' has nwords=1 and n=2; must have nwords >= n
    

    So I've added a filter to remove document with low word count number:

        myCorpus.3 <- tm_filter(myCorpus.2, function (x) {
          length(unlist(strsplit(stringr::str_trim(x$content), '[[:blank:]]+'))) > 1
        })
    

    Then my tokenize function looks like:

    bigramTokenizer <- function(x) {
      x <- as.character(x)
    
      # Find words
      one.list <- c()
      tryCatch({
        one.gram <- ngram::ngram(x, n = 1)
        one.list <- ngram::get.ngrams(one.gram)
      }, 
      error = function(cond) { warning(cond) })
    
      # Find 2-grams
      two.list <- c()
      tryCatch({
        two.gram <- ngram::ngram(x, n = 2)
        two.list <- ngram::get.ngrams(two.gram)
      },
      error = function(cond) { warning(cond) })
    
      res <- unlist(c(one.list, two.list))
      res[res != '']
    }
    

    Then you can test the function with:

    dtmTest <- lapply(myCorpus.3, bigramTokenizer)
    

    And finally:

    dtm <- DocumentTermMatrix(myCorpus.3, control = list(tokenize = bigramTokenizer))
    
    0 讨论(0)
提交回复
热议问题