
stemDocment in tm package not working on past tense word

て烟熏妆下的殇ゞ 提交于 2019-11-29 16:37:37
I have a file 'check_text.txt' that contains " said say says make made ". I'd like to perform stemming on it to get "say say say make make". I tried to use stemDocument in tm package, as the following, but only get "said say say make made". Is there a way to perform stemming on past tense words? Is it necessary to do so in real-world natural language processing? Thanks! filename = 'check_text.txt' con <- file(filename, "rb") text_data <- readLines(con,skipNul = TRUE) close(con) text_VS <- VectorSource(text_data) text_corpus <- VCorpus(text_VS) text_corpus <- tm_map(text_corpus, stemDocument,

Is there a java implementation of Porter2 stemmer

痞子三分冷 提交于 2019-11-27 21:31:30
Do you know any java implementation of the Porter2 stemmer(or any better stemmer written in java)? I know that there is a java version of Porter(not Porter2) here : but on the author mentions that the Porter is bit outdated and recommends to use Porter2, available at However, the problem with me is that this Porter2 is written in snowball(I never heard of it before, so don't know anything about it). What I am exactly looking for is a java

Stemming algorithm that produces real words

纵然是瞬间 提交于 2019-11-27 16:53:52
I need to take a paragraph of text and extract from it a list of "tags". Most of this is quite straight forward. However I need some help now stemming the resulting word list to avoid duplicates. Example: Community / Communities I've used an implementation of Porter Stemmer algorithm (I'm writing in PHP by the way): This works, up to a point, but doesn't return "real" words. The example above is stemmed to "commun". I've tried "Snowball" (suggested within another Stack Overflow thread). For my example

Stemming algorithm that produces real words

醉酒当歌 提交于 2019-11-27 04:10:26
问题 I need to take a paragraph of text and extract from it a list of "tags". Most of this is quite straight forward. However I need some help now stemming the resulting word list to avoid duplicates. Example: Community / Communities I've used an implementation of Porter Stemmer algorithm (I'm writing in PHP by the way): This works, up to a point, but doesn't return "real" words. The example above is stemmed to "commun". I've tried "Snowball"