text2vec

From word vector to document vector [text2vec]

大城市里の小女人 提交于 2019-12-06 07:28:27
问题 I'd like to use the GloVe word embedding implemented in text2vec to perform supervised regression/classification. I read the helpful tutorial on the text2vec homepage on how to generate the word vectors. However, I'm having trouble grasping how to proceed further, namely apply or transform these word vectors and attach them to each document in such a way that each document is represented by a vector (derived from its component words' vectors I assume), to be used as input in a classifier. I

A lemmatizing function using a hash dictionary does not work with tm package in R

孤街醉人 提交于 2019-12-04 20:18:40
I would like to lemmatize Polish text using a large external dictionary (format like in txt variable below). I am not lucky, to have an option Polish with popular text mining packages. The answer https://stackoverflow.com/a/45790325/3480717 by @DmitriySelivanov works well with simple vector of texts. (I have also removed Polish diacritics from both the dictionary and corpus.) The function works well with a vector of texts. Unfortunately it does not work with the corpus format generated by tm. Let me paste Dmitriy's code: library(hashmap) library(data.table) txt = "Abadan Abadanem Abadan

Lemmatization using txt file with lemmes in R

假装没事ソ 提交于 2019-12-04 19:47:56
I would like to use external txt file with Polish lemmas structured as follows: (source for lemmas for many other languages http://www.lexiconista.com/datasets/lemmatization/ ) Abadan Abadanem Abadan Abadanie Abadan Abadanowi Abadan Abadanu abadańczyk abadańczycy abadańczyk abadańczyka abadańczyk abadańczykach abadańczyk abadańczykami abadańczyk abadańczyki abadańczyk abadańczykiem abadańczyk abadańczykom abadańczyk abadańczyków abadańczyk abadańczykowi abadańczyk abadańczyku abadanka abadance abadanka abadanek abadanka abadanką abadanka abadankach abadanka abadankami What packages and with

Really fast word ngram vectorization in R

Deadly 提交于 2019-11-28 18:25:14
edit: The new package text2vec is excellent, and solves this problem (and many others) really well. text2vec on CRAN text2vec on github vignette that illustrates ngram tokenization I have a pretty large text dataset in R, which I've imported as a character vector: #Takes about 15 seconds system.time({ set.seed(1) samplefun <- function(n, x, collapse){ paste(sample(x, n, replace=TRUE), collapse=collapse) } words <- sapply(rpois(10000, 3) + 1, samplefun, letters, '') sents1 <- sapply(rpois(1000000, 5) + 1, samplefun, words, ' ') }) I can convert this character data to a bag-of-words

Really fast word ngram vectorization in R

拟墨画扇 提交于 2019-11-27 11:19:28
问题 edit: The new package text2vec is excellent, and solves this problem (and many others) really well. text2vec on CRAN text2vec on github vignette that illustrates ngram tokenization I have a pretty large text dataset in R, which I've imported as a character vector: #Takes about 15 seconds system.time({ set.seed(1) samplefun <- function(n, x, collapse){ paste(sample(x, n, replace=TRUE), collapse=collapse) } words <- sapply(rpois(10000, 3) + 1, samplefun, letters, '') sents1 <- sapply(rpois