问题
I am doing texting mining using natural language processing. I used quanteda
package to generate a document-feature matrix (dfm). Now I want to do feature selection using a chi-square test.
I know there were already a lot of people asked this question. However, I couldn't find the relevant code for that. (The answers just gave a brief concept, like this: https://stats.stackexchange.com/questions/93101/how-can-i-perform-a-chi-square-test-to-do-feature-selection-in-r)
I learned that I could use chi.squared
in FSelector
package but I don't know how to apply this function to a dfm class object (trainingtfidf
below). (Shows in the manual, it applies to the predictor variable)
Could anyone give me a hint? I appreciate it!
Example code:
description <- c("From month 2 the AST and total bilirubine were not measured.", "16:OTHER - COMMENT REQUIRED IN COMMENT COLUMN;07/02/2004/GENOTYPING;SF- genotyping consent not offered until T4.", "M6 is 13 days out of the visit window")
code <- c(4,3,6)
example <- data.frame(description, code)
library(quanteda)
trainingcorpus <- corpus(example$description)
trainingdfm <- dfm(trainingcorpus, verbose = TRUE, stem=TRUE, toLower=TRUE, removePunct= TRUE, removeSeparators=TRUE, language="english", ignoredFeatures = stopwords("english"), removeNumbers=TRUE, ngrams = 2)
# tf-idf
trainingtfidf <- tfidf(trainingdfm, normalize=TRUE)
sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
回答1:
Here's a general method for computing Chi-squared values for features. It requires that you have some variable against which to form the associations, which here could be some classification variable you are using for training your classifier.
Note that I am showing how to do this in the quanteda package, but the results should be general enough to work for other text package matrix objects. Here, I am using the data from the auxiliary quantedaData package that has all of the State of the Union addresses of US presidents.
data(data_corpus_sotu, package = "quanteda.corpora")
table(docvars(data_corpus_sotu, "party"))
## Democratic Democratic-Republican Federalist Independent
## 90 28 4 8
## Republican Whig
## 9 8
sotuDemRep <- corpus_subset(data_corpus_sotu, party %in% c("Democratic", "Republican"))
# make the document-feature matrix for just Reps and Dems
sotuDfm <- dfm(sotuDemRep, remove = stopwords("english"))
# compute chi-squared values for each feature
chi2vals <- apply(sotuDfm, 2, function(x) {
chisq.test(as.numeric(x), docvars(sotuDemRep, "party"))$statistic
})
head(sort(chi2vals, decreasing = TRUE), 10)
## government will united states year public congress upon
## 85.19783 74.55845 68.62642 66.57434 64.30859 63.19322 59.49949 57.83603
## war people
## 57.43142 57.38697
These can now be selected using the dfm_select()
command. (Note that column indexing by name would also work.)
# select just 100 top Chi^2 vals from dfm
dfmTop100cs <- dfm_select(sotuDfm, names(head(sort(chi2vals, decreasing = TRUE), 100)))
## kept 100 features, from 100 supplied (glob) feature types
head(dfmTop100cs)
## Document-feature matrix of: 182 documents, 100 features.
## (showing first 6 documents and first 6 features)
## features
## docs citizens government upon duties constitution present
## Jackson-1830 14 68 67 12 17 23
## Jackson-1831 21 26 13 7 5 22
## Jackson-1832 17 36 23 11 11 18
## Jackson-1829 17 58 37 16 7 17
## Jackson-1833 14 43 27 18 1 17
## Jackson-1834 24 74 67 11 11 29
Added: With >= v0.9.9 this can be done using the textstat_keyness()
function.
# to avoid empty factors
docvars(data_corpus_sotu, "party") <- as.character(docvars(data_corpus_sotu, "party"))
# make the document-feature matrix for just Reps and Dems
sotuDfm <- data_corpus_sotu %>%
corpus_subset(party %in% c("Democratic", "Republican")) %>%
dfm(remove = stopwords("english"))
chi2vals <- dfm_group(sotuDfm, "party") %>%
textstat_keyness(measure = "chi2")
head(chi2vals)
# feature chi2 p n_target n_reference
# 1 - 221.6249 0 2418 1645
# 2 mexico 181.0586 0 505 182
# 3 bank 164.9412 0 283 60
# 4 " 148.6333 0 1265 800
# 5 million 132.3267 0 366 131
# 6 texas 101.1991 0 174 37
This information can then be used to select the most discriminating features, after the sign of the chi^2 score is removed.
# remove sign
chi2vals$chi2 <- abs(chi2vals$chi2)
# sort
chi2vals <- chi2vals[order(chi2vals$chi2, decreasing = TRUE), ]
head(chi2vals)
# feature chi2 p n_target n_reference
# 1 - 221.6249 0 2418 1645
# 29044 commission 190.3010 0 175 588
# 2 mexico 181.0586 0 505 182
# 3 bank 164.9412 0 283 60
# 4 " 148.6333 0 1265 800
# 29043 law 137.8330 0 607 1178
dfmTop100cs <- dfm_select(sotuDfm, chi2vals$feature)
## kept 100 features, from 100 supplied (glob) feature types
head(dfmTop100cs, nf = 6)
Document-feature matrix of: 6 documents, 6 features (0% sparse).
6 x 6 sparse Matrix of class "dfm"
features
docs fellow citizens senate house representatives :
Jackson-1829 5 17 2 3 5 1
Jackson-1830 6 14 4 6 9 3
Jackson-1831 9 21 3 1 4 1
Jackson-1832 6 17 4 1 2 1
Jackson-1833 2 14 7 4 6 1
Jackson-1834 3 24 5 1 3 5
来源:https://stackoverflow.com/questions/38538821/feature-selection-in-document-feature-matrix-by-using-chi-squared-test