问题
I have some trouble with bigram in text mining using R recently. The purpose is to find the meaningful keywords in news, for example are "smart car" and "data mining".
Let's say if I have a string as follows:
"IBM have a great success in the computer industry for the past decades..."
After removing stopwords("have","a","in","the","for"),
"IBM great success computer industry past decades..."
In a result, bigrams like "success computer" or "industry past" will occur.
But what I really need is that there exist no stopwords between two words, like "computer industry" is a clear example of bigram for what I want.
The part of my code is below:
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument)
NgramTokenizer = function(x) {unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)}
dtm <- TermDocumentMatrix(corpus, control = list(tokenize = NgramTokenizer))
Is there any method to avoid the result with words like "success computer" when TF counting?
回答1:
Note: Edited 2017-10-12 to reflect new quanteda syntax.
You can do this in quanteda, which can remove stop words from ngrams after they have been formed.
txt <- "IBM have a great success in the computer industry for the past decades..."
library("quanteda")
myDfm <- tokens(txt) %>%
tokens_remove("\\p{P}", valuetype = "regex", padding = TRUE) %>%
tokens_remove(stopwords("english"), padding = TRUE) %>%
tokens_ngrams(n = 2) %>%
dfm()
featnames(myDfm)
# [1] "great_success" "computer_industry" "past_decades"
What it does:
- Forms tokens.
- Removes punctuation using the regular expression, but leaves empty spaces where these were removed. This ensures that you will not form ngrams using tokens that were never adjacent to begin with, because they were separated by punctuation.
- Removes the stopwords, also leaving pads in their place.
- Forms the bigrams.
- Constructs the document-feature matrix.
To get a count of these bigrams, you can either inspect the dfm directly, or use topfeatures()
:
myDfm
# Document-feature matrix of: 1 document, 3 features.
# 1 x 3 sparse Matrix of class "dfmSparse"
# features
# docs great_success computer_industry past_decades
# text1 1 1 1
topfeatures(myDfm)
# great_success computer_industry past_decades
# 1 1 1
来源:https://stackoverflow.com/questions/34282370/form-bigrams-without-stopwords-in-r