Preparing word embeddings in text2vec R package

青春壹個敷衍的年華 提交于 2019-12-06 13:44:18

No, you do not need to concatenate reviews. You need just to construct tcm from correct iterator over tokens:

library(text2vec)
data("movie_review")
tokens = movie_review$review %>% tolower %>%  word_tokenizer
it = itoken(tokens)
# create vocabulary
v = create_vocabulary(it) %>% 
  prune_vocabulary(term_count_min = 5)
# create co-occurrence vectorizer
vectorizer = vocab_vectorizer(v, grow_dtm = F, skip_grams_window = 5)

Now we need to reinitialise (for stable 0.3 version. For dev 0.4 don't need to reinitialise iterator):

it = itoken(tokens)
tcm = create_tcm(it, vectorizer)

Fit model:

fit <- glove(tcm = tcm,
             word_vectors_size = 50,
             x_max = 10, learning_rate = 0.2,
             num_iters = 15)
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!