text2vec

Building LDAvis plots using phrase tokens instead of single word tokens

孤街浪徒 提交于 2021-01-29 07:47:36
问题 My question is very simple. How can one build ldavis's frequentist topic modeling plots with phrase tokens instead of single-word tokens using the text2vec package in R. Currently, the word tokenizer tokens = word_tokenizer(tokens) works great but is there a phrase or ngram tokenizer functionality to enable building ldavis topic models and corresponding plots with phrases instead of words? If not, how might such a code be constructed? Is this even methodologically sound or advisable? 来源:

Get LDAvis json from text2vec

孤人 提交于 2021-01-01 13:33:11
问题 Given a document term matrix dtm , text2vec provides a nice integration with the LDAvis package. However, I want to embed this visualisation into a markdown document. The LDAvis package has methods such as createJSON , which would allow me to do this, but these are all hidden inside a private method in textvec. n_topics = 6 lda = LDA$new(n_topics = 6L, doc_topic_prior = 50 / n_topics, topic_word_prior = 1 / n_topics) doc_topic_distr = lda$fit_transform(dtm, n_iter = 1000, convergence_tol = 1e

Lemmatization using txt file with lemmes in R

↘锁芯ラ 提交于 2020-01-13 06:42:25
问题 I would like to use external txt file with Polish lemmas structured as follows: (source for lemmas for many other languages http://www.lexiconista.com/datasets/lemmatization/) Abadan Abadanem Abadan Abadanie Abadan Abadanowi Abadan Abadanu abadańczyk abadańczycy abadańczyk abadańczyka abadańczyk abadańczykach abadańczyk abadańczykami abadańczyk abadańczyki abadańczyk abadańczykiem abadańczyk abadańczykom abadańczyk abadańczyków abadańczyk abadańczykowi abadańczyk abadańczyku abadanka abadance

LDA topic model using R text2vec package and LDAvis in shinyApp

对着背影说爱祢 提交于 2019-12-24 08:04:48
问题 Here is the code for LDA topic modelling with R text2vec package: library(text2vec) tokens = docs$text %>% # docs$text: a colection of text documents word_tokenizer it = itoken(tokens, ids = docs$id, progressbar = FALSE) v = create_vocabulary(it) %>% prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2) vectorizer = vocab_vectorizer(v) dtm = create_dtm(it, vectorizer, type = "dgTMatrix") lda_model = text2vec::LDA$new(n_topics = 10, doc_topic_prior = 0.1, topic_word_prior = 0.01) doc

Preparing word embeddings in text2vec R package

放肆的年华 提交于 2019-12-10 11:25:54
问题 Based on the text2vec package's vignette, an example is provided to create word embedding.The wiki data is tokenized and then term co-occurrence matrix (TCM) is created which is used to create the word embedding using glove function provided in the package. I want to build word embedding for the movie review data provided with the package. My question is: Do i need to collapse all the movie reviews into one long string and then do tokenization. This will cause boundary tokens between 2

A lemmatizing function using a hash dictionary does not work with tm package in R

佐手、 提交于 2019-12-09 23:45:13
问题 I would like to lemmatize Polish text using a large external dictionary (format like in txt variable below). I am not lucky, to have an option Polish with popular text mining packages. The answer https://stackoverflow.com/a/45790325/3480717 by @DmitriySelivanov works well with simple vector of texts. (I have also removed Polish diacritics from both the dictionary and corpus.) The function works well with a vector of texts. Unfortunately it does not work with the corpus format generated by tm.

How to get topic probability table from text2vec LDA

拟墨画扇 提交于 2019-12-08 04:13:22
问题 The LDA topic modeling in the text2vec package is amazing. It is indeed much faster than topicmodel However, I don't know how to get the probability of each document belongs to each topic as the example below: V1 V2 V3 V4 1 0.001025237 7.89E-05 7.89E-05 7.89E-05 2 0.002906977 0.002906977 0.014534884 0.002906977 3 0.003164557 0.003164557 0.003164557 0.003164557 4 7.21E-05 7.21E-05 0.000360334 7.21E-05 5 0.000804433 8.94E-05 8.94E-05 8.94E-05 6 5.63E-05 5.63E-05 5.63E-05 5.63E-05 7 0.001984127

Preparing word embeddings in text2vec R package

青春壹個敷衍的年華 提交于 2019-12-06 13:44:18
Based on the text2vec package's vignette, an example is provided to create word embedding.The wiki data is tokenized and then term co-occurrence matrix (TCM) is created which is used to create the word embedding using glove function provided in the package. I want to build word embedding for the movie review data provided with the package. My question is: Do i need to collapse all the movie reviews into one long string and then do tokenization. This will cause boundary tokens between 2 reviews to co-occur, which does not make sense. **vignettes code:** library(text2vec) library(readr) temp <-

Replace words in text2vec efficiently

风流意气都作罢 提交于 2019-12-06 09:51:57
问题 I have a large text body where I want to replace words with their respective synonyms efficiently (for example replace all occurrences of "automobile" with the synonym "car"). But I struggle to find a proper (efficient way) to do this. For the later analysis I use the text2vec library and would like to use that library for this task as well (avoiding tm to reduce dependencies). An (inefficient) way would look like this: # setup data text <- c("my automobile is quite nice", "I like my car")