问题
I'm using the R tm package, and find that almost none of the tm_map
functions that remove elements of text are working for me.
By 'working' I mean for example, I'll run:
d <- tm_map(d, removeWords, stopwords('english'))
but then when I run
ddtm <- DocumentTermMatrix(d, control = list(
weighting = weightTfIdf,
minWordLength = 2))
findFreqTerms(ddtm, 10)
I still get:
[1] the this
...etc., and a bunch of other stopwords.
I see no error indicating something has gone wrong. Does anyone know what this is, and how to make stopword-removal function correctly, or diagnose what's going wrong for me?
UPDATE
There is an error earlier up that I didn't catch:
Refreshing GOE props...
---Registering Weka Editors---
Trying to add database driver (JDBC): RmiJdbc.RJDriver - Warning, not in CLASSPATH?
Trying to add database driver (JDBC): jdbc.idbDriver - Warning, not in CLASSPATH?
Trying to add database driver (JDBC): org.gjt.mm.mysql.Driver - Warning, not in CLASSPATH?
Trying to add database driver (JDBC): com.mckoi.JDBCDriver - Warning, not in CLASSPATH?
Trying to add database driver (JDBC): org.hsqldb.jdbcDriver - Warning, not in CLASSPATH?
[KnowledgeFlow] Loading properties and plugins...
[KnowledgeFlow] Initializing KF...
It is Weka that is removing stopwords in tm, right? So this could be my problem?
Update 2
From this, this error appears to be unrelated. It's about the db, not about stopwords.
回答1:
Nevermind, it is working. I did the following minimum example:
data("crude")
crude[[1]]
j <- Corpus(VectorSource(crude[[1]]))
jj <- tm_map(j, removeWords, stopwords('english'))
jj[[1]]
I had used several tm_map
expressions in series. It turned out, the order that I had removed spaces, punctuation, etc, had concatenated new stopwords back in.
来源:https://stackoverflow.com/questions/14757489/r-tm-removewords-stopwords-is-not-removing-stopwords