R tm removeWords stopwords is not removing stopwords

问题

I'm using the R tm package, and find that almost none of the tm_map functions that remove elements of text are working for me.

By 'working' I mean for example, I'll run:

d <- tm_map(d, removeWords, stopwords('english'))

but then when I run

ddtm <- DocumentTermMatrix(d, control = list(
    weighting = weightTfIdf,
    minWordLength = 2))
findFreqTerms(ddtm, 10)

I still get:

[1] the     this

...etc., and a bunch of other stopwords.

I see no error indicating something has gone wrong. Does anyone know what this is, and how to make stopword-removal function correctly, or diagnose what's going wrong for me?

UPDATE

There is an error earlier up that I didn't catch:

Refreshing GOE props...
---Registering Weka Editors---
Trying to add database driver (JDBC): RmiJdbc.RJDriver - Warning, not in CLASSPATH?
Trying to add database driver (JDBC): jdbc.idbDriver - Warning, not in CLASSPATH?
Trying to add database driver (JDBC): org.gjt.mm.mysql.Driver - Warning, not in CLASSPATH?
Trying to add database driver (JDBC): com.mckoi.JDBCDriver - Warning, not in CLASSPATH?
Trying to add database driver (JDBC): org.hsqldb.jdbcDriver - Warning, not in CLASSPATH?
[KnowledgeFlow] Loading properties and plugins...
[KnowledgeFlow] Initializing KF...

It is Weka that is removing stopwords in tm, right? So this could be my problem?

Update 2

From this, this error appears to be unrelated. It's about the db, not about stopwords.

回答1:

Nevermind, it is working. I did the following minimum example:

data("crude")
crude[[1]]
j <- Corpus(VectorSource(crude[[1]]))
jj <- tm_map(j, removeWords, stopwords('english'))
jj[[1]]

I had used several tm_map expressions in series. It turned out, the order that I had removed spaces, punctuation, etc, had concatenated new stopwords back in.

来源：https://stackoverflow.com/questions/14757489/r-tm-removewords-stopwords-is-not-removing-stopwords

标签

nlp

stop-words