bigrams instead of single words in termdocument matrix using R and Rweka

前端未结

关注

 2  1015

I\'ve found a way to use use bigrams instead of single tokens in a term-document matrix. The solution has been posed on stackoverflow here: findAssocs for multiple terms in

相关标签:

2条回答

独厮守ぢ

2020-11-30 04:42
Inspired by Anthony's comment, I found out that you can specify the number of threads that the parallel library uses by default (specify it before you call the NgramTokenizer):
```
# Sets the default number of threads to use
options(mc.cores=1)
```
Since the NGramTokenizer seems to hang on the parallel::mclapply call, changing the number of threads seems to work around it.
0 讨论(0)
发布评论:

提交评论
- 加载中...
星月不相逢

2020-11-30 04:48
Seems there are problems using RWeka with parallel package. I found workaround solution here.

The most important point is not loading the RWeka package and use the namespace in a encapsulated function.

So your tokenizer should look like
```
BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))}
```
0 讨论(0)
发布评论:

提交评论
- 加载中...