text2vec | 易学教程

Building LDAvis plots using phrase tokens instead of single word tokens

阅读更多关于 Building LDAvis plots using phrase tokens instead of single word tokens

问题 My question is very simple. How can one build ldavis's frequentist topic modeling plots with phrase tokens instead of single-word tokens using the text2vec package in R. Currently, the word tokenizer tokens = word_tokenizer(tokens) works great but is there a phrase or ngram tokenizer functionality to enable building ldavis topic models and corresponding plots with phrases instead of words? If not, how might such a code be constructed? Is this even methodologically sound or advisable? 来源：

Get LDAvis json from text2vec

阅读更多关于 Get LDAvis json from text2vec

问题 Given a document term matrix dtm , text2vec provides a nice integration with the LDAvis package. However, I want to embed this visualisation into a markdown document. The LDAvis package has methods such as createJSON , which would allow me to do this, but these are all hidden inside a private method in textvec. n_topics = 6 lda = LDA$new(n_topics = 6L, doc_topic_prior = 50 / n_topics, topic_word_prior = 1 / n_topics) doc_topic_distr = lda$fit_transform(dtm, n_iter = 1000, convergence_tol = 1e

How do I include stopwords(terms) in text2vec

阅读更多关于 How do I include stopwords(terms) in text2vec

来源： https://stackoverflow.com/questions/48924017/how-do-i-include-stopwordsterms-in-text2vec

Lemmatization using txt file with lemmes in R

阅读更多关于 Lemmatization using txt file with lemmes in R

问题 I would like to use external txt file with Polish lemmas structured as follows: (source for lemmas for many other languages http://www.lexiconista.com/datasets/lemmatization/) Abadan Abadanem Abadan Abadanie Abadan Abadanowi Abadan Abadanu abadańczyk abadańczycy abadańczyk abadańczyka abadańczyk abadańczykach abadańczyk abadańczykami abadańczyk abadańczyki abadańczyk abadańczykiem abadańczyk abadańczykom abadańczyk abadańczyków abadańczyk abadańczykowi abadańczyk abadańczyku abadanka abadance

LDA topic model using R text2vec package and LDAvis in shinyApp

阅读更多关于 LDA topic model using R text2vec package and LDAvis in shinyApp

问题 Here is the code for LDA topic modelling with R text2vec package: library(text2vec) tokens = docs$text %>% # docs$text: a colection of text documents word_tokenizer it = itoken(tokens, ids = docs$id, progressbar = FALSE) v = create_vocabulary(it) %>% prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2) vectorizer = vocab_vectorizer(v) dtm = create_dtm(it, vectorizer, type = "dgTMatrix") lda_model = text2vec::LDA$new(n_topics = 10, doc_topic_prior = 0.1, topic_word_prior = 0.01) doc

Preparing word embeddings in text2vec R package

阅读更多关于 Preparing word embeddings in text2vec R package

问题 Based on the text2vec package's vignette, an example is provided to create word embedding.The wiki data is tokenized and then term co-occurrence matrix (TCM) is created which is used to create the word embedding using glove function provided in the package. I want to build word embedding for the movie review data provided with the package. My question is: Do i need to collapse all the movie reviews into one long string and then do tokenization. This will cause boundary tokens between 2

A lemmatizing function using a hash dictionary does not work with tm package in R

阅读更多关于 A lemmatizing function using a hash dictionary does not work with tm package in R

问题 I would like to lemmatize Polish text using a large external dictionary (format like in txt variable below). I am not lucky, to have an option Polish with popular text mining packages. The answer https://stackoverflow.com/a/45790325/3480717 by @DmitriySelivanov works well with simple vector of texts. (I have also removed Polish diacritics from both the dictionary and corpus.) The function works well with a vector of texts. Unfortunately it does not work with the corpus format generated by tm.

How to get topic probability table from text2vec LDA

阅读更多关于 How to get topic probability table from text2vec LDA

问题 The LDA topic modeling in the text2vec package is amazing. It is indeed much faster than topicmodel However, I don't know how to get the probability of each document belongs to each topic as the example below: V1 V2 V3 V4 1 0.001025237 7.89E-05 7.89E-05 7.89E-05 2 0.002906977 0.002906977 0.014534884 0.002906977 3 0.003164557 0.003164557 0.003164557 0.003164557 4 7.21E-05 7.21E-05 0.000360334 7.21E-05 5 0.000804433 8.94E-05 8.94E-05 8.94E-05 6 5.63E-05 5.63E-05 5.63E-05 5.63E-05 7 0.001984127

Preparing word embeddings in text2vec R package

阅读更多关于 Preparing word embeddings in text2vec R package

Based on the text2vec package's vignette, an example is provided to create word embedding.The wiki data is tokenized and then term co-occurrence matrix (TCM) is created which is used to create the word embedding using glove function provided in the package. I want to build word embedding for the movie review data provided with the package. My question is: Do i need to collapse all the movie reviews into one long string and then do tokenization. This will cause boundary tokens between 2 reviews to co-occur, which does not make sense. **vignettes code:** library(text2vec) library(readr) temp <-

Replace words in text2vec efficiently

阅读更多关于 Replace words in text2vec efficiently

问题 I have a large text body where I want to replace words with their respective synonyms efficiently (for example replace all occurrences of "automobile" with the synonym "car"). But I struggle to find a proper (efficient way) to do this. For the later analysis I use the text2vec library and would like to use that library for this task as well (avoiding tm to reduce dependencies). An (inefficient) way would look like this: # setup data text <- c("my automobile is quite nice", "I like my car")