quanteda | 易学教程

How to compute similarity in quanteda between documents for adjacent years only, within groups?

阅读更多关于 How to compute similarity in quanteda between documents for adjacent years only, within groups?

问题 I have a diachronic corpus with texts for different organizations, each for years 1969 to 2019. For each organization, I want to compare text for year 1969 and text for 1970, 1970 and 1971, etc. Texts for some years are missing. In other words, I have a corpus, cc, which I converted to a dfm Now I want to use textstat_simil : ncsimil <- textstat_simil(dfm.cc, y = NULL, selection = NULL, margin = "documents", method = "jaccard", min_simil = NULL) This compares every text with every other text,

How to compute similarity in quanteda between documents for adjacent years only, within groups?

阅读更多关于 How to compute similarity in quanteda between documents for adjacent years only, within groups?

How to compute similarity in quanteda between documents for adjacent years only, within groups?

阅读更多关于 How to compute similarity in quanteda between documents for adjacent years only, within groups?

Feature selection in document-feature matrix by using chi-squared test

阅读更多关于 Feature selection in document-feature matrix by using chi-squared test

问题 I am doing texting mining using natural language processing. I used quanteda package to generate a document-feature matrix (dfm). Now I want to do feature selection using a chi-square test. I know there were already a lot of people asked this question. However, I couldn't find the relevant code for that. (The answers just gave a brief concept, like this: https://stats.stackexchange.com/questions/93101/how-can-i-perform-a-chi-square-test-to-do-feature-selection-in-r) I learned that I could use

Feature selection in document-feature matrix by using chi-squared test

阅读更多关于 Feature selection in document-feature matrix by using chi-squared test

Keep the word frequency and inverse for one type of documents

阅读更多关于 Keep the word frequency and inverse for one type of documents

问题 Code example to keep the term and inverse frequency: library(dplyr) library(janeaustenr) library(tidytext) book_words <- austen_books() %>% unnest_tokens(word, text) %>% count(book, word, sort = TRUE) total_words <- book_words %>% group_by(book) %>% summarize(total = sum(n)) book_words <- left_join(book_words, total_words) book_words <- book_words %>% bind_tf_idf(word, book, n) book_words %>% select(-total) %>% arrange(desc(tf_idf)) My problem is that this example uses multiple books. I have

How to convert DFM into dataframe BUT keeping docvars?

阅读更多关于 How to convert DFM into dataframe BUT keeping docvars?

问题 I am using the quanteda package and the very good tutorials that have been written about it to make various operations on paper articles. I obtained the frequency of specific words over time by selecting them in a mainwordsDFM and using textstat_frequency(mainwordsDFM, group = "Date") , then converted the result into a dataframe, and plotted with ggplot. However, I now try to plot the frequency of a word over time and by paper . The solution I used on my previous operation does not work in

Is it possible to use `kwic` function to find words near to each other?

阅读更多关于 Is it possible to use `kwic` function to find words near to each other?

问题 I found this reference : https://www.safaribooksonline.com/library/view/regular-expressions-cookbook/9781449327453/ch05s07.html Is it possible to use it with kwic function in the quanteda package to be able to find documents in a corpus containing words that are not "stuck" but close to each other, with maybe a few other words between ? for example, if I give two words in the function, I would like to find the documents in a corpus where these two words occur but maybe with some words between

Keyword in context (kwic) for skipgrams?

阅读更多关于 Keyword in context (kwic) for skipgrams?

问题 I do keyword in context analysis with quanteda for ngrams and tokens and it works well. I now want to do it for skipgrams, capture the context of "barriers to entry" but also "barriers to [...] [and] entry. The following code a kwic object which is empty but I don't know what I did wrong. dcc.corpus refers to the text document. I also used the tokenized version but nothing changes. The result is: "kwic object with 0 rows" x <- tokens("barriers entry") ntoken_test <- tokens_ngrams(x, n = 2,

Keyword in context (kwic) for skipgrams?

阅读更多关于 Keyword in context (kwic) for skipgrams?