quanteda

Keyword in context (kwic) for skipgrams?

て烟熏妆下的殇ゞ 提交于 2020-12-12 02:02:34
问题 I do keyword in context analysis with quanteda for ngrams and tokens and it works well. I now want to do it for skipgrams, capture the context of "barriers to entry" but also "barriers to [...] [and] entry. The following code a kwic object which is empty but I don't know what I did wrong. dcc.corpus refers to the text document. I also used the tokenized version but nothing changes. The result is: "kwic object with 0 rows" x <- tokens("barriers entry") ntoken_test <- tokens_ngrams(x, n = 2,

R: weighted inverse document frequency (tfidf) similarity between strings

為{幸葍}努か 提交于 2020-08-09 08:17:12
问题 I want to be able to find similarity between two strings, weighting each token (word) with its inverse document frequency (those frequencies are not taken from those strings). Using quanteda I can create a dfm_tfidf with inverted frequency weights, but do not know how to proceed after that. Sample data : ss = c( "ibm madrid limited research", "madrid limited research", "limited research", "research" ) counts = list(ibm = 1, madrid = 2, limited = 3, research = 4) cor = corpus(long_list_of

Merge two dataframe by rows using common words [duplicate]

北战南征 提交于 2020-07-15 08:32:08
问题 This question already has answers here : dplyr: inner_join with a partial string match (4 answers) Closed 9 days ago . df1 <- data.frame(freetext = c("open until monday night", "one more time to insert your coin"), numid = c(291,312)) df2 <- data.frame(freetext = c("open until night", "one time to insert your be"), aid = c(3,5)) I would line to merge the two dataframe using the freetext column as by option. However the text is not totally the same as some words removed or displayed. Is there

How to do fuzzy pattern matching with quanteda and kwic?

放肆的年华 提交于 2020-06-27 15:08:09
问题 I have texts written by doctors and I want to be able to highlight specific words in their context (5 words before and 5 words after the word I search for in their text). Say I want to search for the word 'suicidal'. I would then use the kwic function in the quanteda package: kwic(dataset, pattern = “suicidal”, window = 5) So far, so good, but say I want to allow for the possibility of typos. In this case I want to allow for three deviating characters, with no restriction on where in the word

How to initialize second glove model with solution from first?

一世执手 提交于 2020-05-30 03:38:38
问题 I am trying to implement one of the solutions to the question about How to align two GloVe models in text2vec?. I don't understand what are the proper values for input at GlobalVectors$new(..., init = list(w_i, w_j) . How do I ensure the values for w_i and w_j are correct? Here's a minimal reproducible example. First, prepare some corpora to compare, taken from the quanteda tutorial. I am using dfm_match(all_words) to try and ensure all words are present in each set, but this doesn't seem to

R: Quanteda's textstat_simil function

倾然丶 夕夏残阳落幕 提交于 2020-03-21 06:41:05
问题 I am using Quanteda's textstat_simil to compute semantic relatedness in a text corpus. The use of this function is explained here: https://rdrr.io/cran/quanteda/man/textstat_simil.html This is a running example and it works fine: # compute term similarities pres_dfm <- dfm(data_corpus_inaugural, remove_punct = TRUE, remove = stopwords("english")) (s1 <- textstat_simil(pres_dfm, c("fair", "health", "terror"), method = "cosine", margin = "features")) head(as.matrix(s1, 10) as.list(s1, n=8) I

How to do named entity recognition (NER) using quanteda?

妖精的绣舞 提交于 2020-03-20 20:58:38
问题 Having a dataframe with text df = data.frame(id=c(1,2), text = c("My best friend John works and Google", "However he would like to work at Amazon as he likes to use python and stay at Canada") Without any preprocessing How is it possible to extract name entity recognition like this Example results words dfresults = data.frame(id=c(1,2), ner_words = c("John, Google", "Amazon, python, Canada") 回答1: You can do this without quanteda , using the spacyr package -- a wrapper around the spaCy library

How to do named entity recognition (NER) using quanteda?

余生长醉 提交于 2020-03-20 20:54:06
问题 Having a dataframe with text df = data.frame(id=c(1,2), text = c("My best friend John works and Google", "However he would like to work at Amazon as he likes to use python and stay at Canada") Without any preprocessing How is it possible to extract name entity recognition like this Example results words dfresults = data.frame(id=c(1,2), ner_words = c("John, Google", "Amazon, python, Canada") 回答1: You can do this without quanteda , using the spacyr package -- a wrapper around the spaCy library

Argument ngrams not used

不羁岁月 提交于 2020-01-24 20:56:06
问题 I use quanteda for text analysis I use this commands corp_df2 <- tokens(df$text, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% tokens_remove(pattern = stopwords(source = "smart")) %>% tokens_wordstem() corp_df3 <- dfm(corp_df2) %>% dfm_trim(min_termfreq = 0.95, termfreq_type = "quantile") myDfm <- dfm(corp_df3, ngrams = c(1,3)) But I receive this error Argument ngrams not used. How can I use the command to receive ngrams? 来源: https://stackoverflow.com/questions

Split up ngrams in document-feature matrix (quanteda)

China☆狼群 提交于 2020-01-17 06:09:26
问题 I was wonderig if it's possible to split up ngram-features in a document-feature matrix (dfm) in such a way that e.g. a bigram results in two separate unigrams? head(dfm, n = 3, nfeature = 4) docs in_the great plenary emission_reduction 10752099 3 1 1 3 10165509 8 0 0 3 10479890 4 0 0 1 So, the above dfm would result in something like this: head(dfm, n = 3, nfeature = 4) docs in great plenary emission the reduction 10752099 3 1 1 3 3 3 10165509 8 0 0 3 8 3 10479890 4 0 0 1 4 1 For better