quanteda | 易学教程

Keyword in context (kwic) for skipgrams?

阅读更多关于 Keyword in context (kwic) for skipgrams?

问题 I do keyword in context analysis with quanteda for ngrams and tokens and it works well. I now want to do it for skipgrams, capture the context of "barriers to entry" but also "barriers to [...] [and] entry. The following code a kwic object which is empty but I don't know what I did wrong. dcc.corpus refers to the text document. I also used the tokenized version but nothing changes. The result is: "kwic object with 0 rows" x <- tokens("barriers entry") ntoken_test <- tokens_ngrams(x, n = 2,

R: weighted inverse document frequency (tfidf) similarity between strings

阅读更多关于 R: weighted inverse document frequency (tfidf) similarity between strings

问题 I want to be able to find similarity between two strings, weighting each token (word) with its inverse document frequency (those frequencies are not taken from those strings). Using quanteda I can create a dfm_tfidf with inverted frequency weights, but do not know how to proceed after that. Sample data : ss = c( "ibm madrid limited research", "madrid limited research", "limited research", "research" ) counts = list(ibm = 1, madrid = 2, limited = 3, research = 4) cor = corpus(long_list_of

Merge two dataframe by rows using common words [duplicate]

阅读更多关于 Merge two dataframe by rows using common words [duplicate]

问题 This question already has answers here : dplyr: inner_join with a partial string match (4 answers) Closed 9 days ago . df1 <- data.frame(freetext = c("open until monday night", "one more time to insert your coin"), numid = c(291,312)) df2 <- data.frame(freetext = c("open until night", "one time to insert your be"), aid = c(3,5)) I would line to merge the two dataframe using the freetext column as by option. However the text is not totally the same as some words removed or displayed. Is there

How to do fuzzy pattern matching with quanteda and kwic?

阅读更多关于 How to do fuzzy pattern matching with quanteda and kwic?

问题 I have texts written by doctors and I want to be able to highlight specific words in their context (5 words before and 5 words after the word I search for in their text). Say I want to search for the word 'suicidal'. I would then use the kwic function in the quanteda package: kwic(dataset, pattern = “suicidal”, window = 5) So far, so good, but say I want to allow for the possibility of typos. In this case I want to allow for three deviating characters, with no restriction on where in the word

How to initialize second glove model with solution from first?

阅读更多关于 How to initialize second glove model with solution from first?

问题 I am trying to implement one of the solutions to the question about How to align two GloVe models in text2vec?. I don't understand what are the proper values for input at GlobalVectors$new(..., init = list(w_i, w_j) . How do I ensure the values for w_i and w_j are correct? Here's a minimal reproducible example. First, prepare some corpora to compare, taken from the quanteda tutorial. I am using dfm_match(all_words) to try and ensure all words are present in each set, but this doesn't seem to

R: Quanteda's textstat_simil function

阅读更多关于 R: Quanteda's textstat_simil function

问题 I am using Quanteda's textstat_simil to compute semantic relatedness in a text corpus. The use of this function is explained here: https://rdrr.io/cran/quanteda/man/textstat_simil.html This is a running example and it works fine: # compute term similarities pres_dfm <- dfm(data_corpus_inaugural, remove_punct = TRUE, remove = stopwords("english")) (s1 <- textstat_simil(pres_dfm, c("fair", "health", "terror"), method = "cosine", margin = "features")) head(as.matrix(s1, 10) as.list(s1, n=8) I

How to do named entity recognition (NER) using quanteda?

阅读更多关于 How to do named entity recognition (NER) using quanteda?

问题 Having a dataframe with text df = data.frame(id=c(1,2), text = c("My best friend John works and Google", "However he would like to work at Amazon as he likes to use python and stay at Canada") Without any preprocessing How is it possible to extract name entity recognition like this Example results words dfresults = data.frame(id=c(1,2), ner_words = c("John, Google", "Amazon, python, Canada") 回答1: You can do this without quanteda , using the spacyr package -- a wrapper around the spaCy library

How to do named entity recognition (NER) using quanteda?

阅读更多关于 How to do named entity recognition (NER) using quanteda?

Argument ngrams not used

阅读更多关于 Argument ngrams not used

问题 I use quanteda for text analysis I use this commands corp_df2 <- tokens(df$text, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% tokens_remove(pattern = stopwords(source = "smart")) %>% tokens_wordstem() corp_df3 <- dfm(corp_df2) %>% dfm_trim(min_termfreq = 0.95, termfreq_type = "quantile") myDfm <- dfm(corp_df3, ngrams = c(1,3)) But I receive this error Argument ngrams not used. How can I use the command to receive ngrams? 来源： https://stackoverflow.com/questions

Split up ngrams in document-feature matrix (quanteda)

阅读更多关于 Split up ngrams in document-feature matrix (quanteda)

问题 I was wonderig if it's possible to split up ngram-features in a document-feature matrix (dfm) in such a way that e.g. a bigram results in two separate unigrams? head(dfm, n = 3, nfeature = 4) docs in_the great plenary emission_reduction 10752099 3 1 1 3 10165509 8 0 0 3 10479890 4 0 0 1 So, the above dfm would result in something like this: head(dfm, n = 3, nfeature = 4) docs in great plenary emission the reduction 10752099 3 1 1 3 3 3 10165509 8 0 0 3 8 3 10479890 4 0 0 1 4 1 For better