R Lime package for text data

问题

I was exploring the use of R lime on text datasets to explain black box model predictions and came across an example https://cran.r-project.org/web/packages/lime/vignettes/Understanding_lime.html

Was testing on a restaurant review dataset but found some that the plot_features produced doesn't print all the features. I was wondering if anyone could provide any advice/insights for me on this as to why this happens or recommend a different package to use. Help here is greatly appreciated since not much work on R lime can be found online. Thanks!

Dataset: https://drive.google.com/file/d/1-pzY7IQVyB_GmT5dT0yRx3hYzOFGrZSr/view?usp=sharing

# Importing the dataset
dataset_original = read.delim('Restaurant_Reviews.tsv', quote = '', stringsAsFactors = FALSE)

# Cleaning the texts
# install.packages('tm')
# install.packages('SnowballC')
library(tm)
library(SnowballC)
corpus = VCorpus(VectorSource(dataset_original$Review))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords())
corpus = tm_map(corpus, stemDocument)
corpus = tm_map(corpus, stripWhitespace)

# Creating the Bag of Words model
dtm = DocumentTermMatrix(corpus)
dtm = removeSparseTerms(dtm, 0.999)
dataset = as.data.frame(as.matrix(dtm))
dataset$Liked = dataset_original$Liked

# Encoding the target feature as factor
dataset$Liked = factor(dataset$Liked, levels = c(0, 1))

# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Liked, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

library(caret)
model <- train(Liked~., data=training_set, method="xgbTree")

######
#LIME#
######
library(lime)
explainer <- lime(training_set, model)
explanation <- explain(test_set[1:4,], explainer, n_labels = 1, n_features = 5)
plot_features(explanation)

My undesired output: https://www.dropbox.com/s/pf9dq0kba0d5flt/Udemy_NLP_Lime.jpeg?dl=0

What I want (different dataset): https://www.dropbox.com/s/e1472i4yw1owmlc/DMT_A5_lime.jpeg?dl=0

回答1:

I could not open the links you provided for the dataset and output. However, I am using the same link you provided https://cran.r-project.org/web/packages/lime/vignettes/Understanding_lime.html . I use text2vec, as it is in the link, and xgboost package for classification; and it works for me. To display more features, you may need to increase the value of n_features in explain function, see https://www.rdocumentation.org/packages/lime/versions/0.4.0/topics/explain .

library(lime)
library(xgboost)  # the classifier
library(text2vec) # used to build the BoW matrix

# load data
data(train_sentences, package = "lime")  # from lime 
data(test_sentences, package = "lime")   # from lime

# Tokenize data
get_matrix <- function(text) {
  it <- text2vec::itoken(text, progressbar = FALSE)

  # use the following lines if you want to prune vocabulary
  # vocab <- create_vocabulary(it, c(1L, 1L)) %>%   
  # prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)
  #   vectorizer <- vocab_vectorizer(vocab )

  # there is no option to prune the vocabulary, but it is very fast for big data
  vectorizer <- hash_vectorizer(hash_size = 2 ^ 10, ngram = c(1L, 1L))
  text2vec::create_dtm(it,vectorizer = vectorizer) # hash_vectorizer())
}

# BoW matrix generation
# features should be the same for both dtm_train and dtm_test 
dtm_train <- get_matrix(train_sentences$text)
dtm_test  <- get_matrix(test_sentences$text) 

# xgboost for classification
param <- list(max_depth = 7, 
          eta = 0.1, 
          objective = "binary:logistic", 
          eval_metric = "error", 
          nthread = 1)

xgb_model <-xgboost::xgb.train(
  param, 
  xgb.DMatrix(dtm_train, label = train_sentences$class.text == "OWNX"),
  nrounds = 100 
)

# prediction
predictions <- predict(xgb_model, dtm_test) > 0.5
test_labels <- test_sentences$class.text == "OWNX"

# Accuracy
print(mean(predictions == test_labels))

# what are the most important words for the predictions.
n_features <- 5 # number of features to display
sentence_to_explain <- head(test_sentences[test_labels,]$text, 6)
explainer <- lime::lime(sentence_to_explain, model = xgb_model, 
                    preprocess = get_matrix)
explanation <- lime::explain(sentence_to_explain, explainer, n_labels = 1, 
                         n_features = n_features)

#
explanation[, 2:9]

# plot
lime::plot_features(explanation)

In your code, NAs are created in the following line, when applying on train_sentences dataset. Please check your code for the following.

dataset$Liked = factor(dataset$Liked, levels = c(0, 1))

Removing levels or changing levels to labels works for me.

Please check your data structure and make sure your data is not zero matrix due to those NAs, or it is not too sparse. It may also cause the problem as it cannot find top n features.

来源：https://stackoverflow.com/questions/50752133/r-lime-package-for-text-data

标签

nlp

tokenize