Form bigrams without stopwords in R

末鹿安然 提交于 2019-12-24 01:59:15

问题


I have some trouble with bigram in text mining using R recently. The purpose is to find the meaningful keywords in news, for example are "smart car" and "data mining".

Let's say if I have a string as follows:

"IBM have a great success in the computer industry for the past decades..."

After removing stopwords("have","a","in","the","for"),

"IBM great success computer industry past decades..."

In a result, bigrams like "success computer" or "industry past" will occur.

But what I really need is that there exist no stopwords between two words, like "computer industry" is a clear example of bigram for what I want.

The part of my code is below:

corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace) 
corpus <- tm_map(corpus, stemDocument)
NgramTokenizer = function(x) {unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)}
dtm <- TermDocumentMatrix(corpus, control = list(tokenize = NgramTokenizer))

Is there any method to avoid the result with words like "success computer" when TF counting?


回答1:


Note: Edited 2017-10-12 to reflect new quanteda syntax.

You can do this in quanteda, which can remove stop words from ngrams after they have been formed.

txt <- "IBM have a great success in the computer industry for the past decades..."

library("quanteda")
myDfm <- tokens(txt) %>%
    tokens_remove("\\p{P}", valuetype = "regex", padding = TRUE) %>%
    tokens_remove(stopwords("english"), padding  = TRUE) %>%
    tokens_ngrams(n = 2) %>%
    dfm()

featnames(myDfm)
# [1] "great_success"     "computer_industry" "past_decades" 

What it does:

  1. Forms tokens.
  2. Removes punctuation using the regular expression, but leaves empty spaces where these were removed. This ensures that you will not form ngrams using tokens that were never adjacent to begin with, because they were separated by punctuation.
  3. Removes the stopwords, also leaving pads in their place.
  4. Forms the bigrams.
  5. Constructs the document-feature matrix.

To get a count of these bigrams, you can either inspect the dfm directly, or use topfeatures():

myDfm
# Document-feature matrix of: 1 document, 3 features.
# 1 x 3 sparse Matrix of class "dfmSparse"
#        features
# docs    great_success computer_industry past_decades
#   text1             1                 1            1

topfeatures(myDfm)
#    great_success computer_industry      past_decades 
#                1                 1                 1 


来源:https://stackoverflow.com/questions/34282370/form-bigrams-without-stopwords-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!