问题
I used quanteda::textmodel_NB
to create a model that categorizes text into one of two categories. I fit the model on a training data set of data from last summer.
Now, I am trying to use it this summer to categorize new text we get here at work. I tried doing this and got the following error:
Error in predict.textmodel_NB_fitted(model, test_dfm) :
feature set in newdata different from that in training set
The code in the function that generates the error can be found here at lines 157 to 165.
I assume this occurs because the words in the training data set do not exactly match the words used in the test data set. But why does this error occur? I feel as if—to be useful in real-world examples—the model should be able to handle data sets that contain different features, as this is what will probably always happen in applied use.
So my first question is:
1. Is this error a property of the naive Bayes algorithm? Or was it a choice made by the author of the function to do this?
Which then leads me to my second question:
2. How can I remedy this issue?
To get at this second question, I provide reproducible code (the last line generates the error above):
library(quanteda)
library(magrittr)
library(data.table)
train_text <- c("Can random effects apply only to categorical variables?",
"ANOVA expectation identity",
"Statistical test for significance in ranking positions",
"Is Fisher Sharp Null Hypothesis testable?",
"List major reasons for different results from survival analysis among different studies",
"How do the tenses and aspects in English correspond temporally to one another?",
"Is there a correct gender-neutral singular pronoun (“his” vs. “her” vs. “their”)?",
"Are collective nouns always plural, or are certain ones singular?",
"What’s the rule for using “who” and “whom” correctly?",
"When is a gerund supposed to be preceded by a possessive adjective/determiner?")
train_class <- factor(c(rep(0,5), rep(1,5)))
train_dfm <- train_text %>%
dfm(tolower=TRUE, stem=TRUE, remove=stopwords("english"))
model <- textmodel_NB(train_dfm, train_class)
test_text <- c("Weighted Linear Regression with Proportional Standard Deviations in R",
"What do significance tests for adjusted means tell us?",
"How should I punctuate around quotes?",
"Should I put a comma before the last item in a list?")
test_dfm <- test_text %>%
dfm(tolower=TRUE, stem=TRUE, remove=stopwords("english"))
predict(model, test_dfm)
The only thing I have thought to do was to manually make the features the same (I assumed that this would fill in 0
for features that are not present in the object), but this generated a new error. The code for the example above is:
model_features <- model$data$x@Dimnames$features # gets the features of the training data
test_features <- test_dfm@Dimnames$features # gets the features of the test data
all_features <- c(model_features, test_features) %>% # combining the two sets of features...
subset(!duplicated(.)) # ...and getting rid of duplicate features
model$data$x@Dimnames$features <- test_dfm@Dimnames$features <- all_features # replacing features of model and test_dfm with all_features
predict(model, dfm) # new error?
However, this code generates a new error:
Error in if (ncol(object$PcGw) != ncol(newdata)) stop("feature set in newdata different from that in training set") :
argument is of length zero
How do I apply this naive Bayes model to a new data set with different features?
回答1:
Fortunately there is an easy method to do this: You can use dfm_select()
on your test data to give identical features (and ordering of features) to the training set. It's this simple:
test_dfm <- dfm_select(test_dfm, train_dfm)
predict(model, test_dfm)
## Predicted textmodel of type: Naive Bayes
##
## lp(0) lp(1) Pr(0) Pr(1) Predicted
## text1 -0.6931472 -0.6931472 0.5000 0.5000 0
## text2 -11.8698712 -13.1879095 0.7889 0.2111 0
## text3 -4.1484118 -3.6635616 0.3811 0.6189 1
## text4 -8.0091415 -8.4257356 0.6027 0.3973 0
回答2:
As of May 2018, there appears to be a "force = TRUE"
option now that will also do the job for you too:
predict(model, test_dfm, force = TRUE)
# text1 text2 text3 text4
# 0 0 1 0
# Levels: 0 1
Source: koheiw and kbenoit discussion on the quanteda Github - https://github.com/quanteda/quanteda/issues/1329
来源:https://stackoverflow.com/questions/44136757/quanteda-package-naive-bayes-how-can-i-predict-on-different-featured-test-data