问题
I have a diachronic corpus with texts for different organizations, each for years 1969 to 2019. For each organization, I want to compare text for year 1969 and text for 1970, 1970 and 1971, etc. Texts for some years are missing.
In other words,
I have a corpus, cc, which I converted to a dfm
Now I want to use textstat_simil
:
ncsimil <- textstat_simil(dfm.cc,
y = NULL,
selection = NULL,
margin = "documents",
method = "jaccard",
min_simil = NULL)
This compares every text with every other text, resulting in a 2.6+ million lines. I really only need to compare certain texts with the text immediately above, like this:
TextA
TextB
TextC
TextD (has NA)
TextE
So, I want the jaccard statistic for A and B
B and C, and (since some have NA values)
D and E
I am curious about the y =
in textstat_simil
Quanteda package says
"y is an optional target matrix matching x in the margin on which the similarity or distance will be computed."
It is not clear to me what this means.
Does it mean I can create two different data frames
A
B
C
D
E
and
B
C
D
E
F
So that I will get a similarity statistic for
A and B
B and C
and so forth?
Or is there a better way to do this?
Edited starting here... I converted to a data.frame:
df <- convert(dfm.cc, to = "data.frame")
I did bind_cols to add docvars and token counts (2,405 columns -- short texts).
I have isolated the initial texts in a series, e.g.,
OrgA 1970, 1st_in_Series_Yes, TokCount 1...etc.
OrgA 1971, 1st_in_Series_No, TokCount 1...etc.
OrgA 1972, 1st_in_Series_No, TokCount 1...etc.
OrgA 1973, NA
OrgA 1974, 1st_in_Series_Yes, TokCount 1...etc.
OrgZ 1975, 1st_in_Series_No, TokCount 1...etc.
So as not to compare
OrgA 1973 NA with OrgA 1972
or
OrgA 1974 with OrgA 1973
Manually computing Jaccard should work from here, but there's probably a smarter way. Please share solutions. Thanks.
回答1:
Interesting question. I don't have a reproducible example to work with, but I think I can create one using the built-in inaugural corpus dataset. Here, I will use the document variables Year
for the time variable, and the unique president (full) name as an analog for your organization (since you don't want year-to-year comparisons of different organizations. So if you substitute your organization and time variable for the ones below this should work.
Note that I make the outer "loop" an lapply, and the inner is an actual loop, but there are clever ways to make the inner part also an lapply. Here I've left it as a for loop for simplicity.
First, get a unique name, since some (different) presidents share the same last names.
library("quanteda")
## Package version: 2.0.1
data_corpus_inaugural$president <- paste(data_corpus_inaugural$President,
data_corpus_inaugural$FirstName,
sep = ", "
)
head(data_corpus_inaugural$president, 10)
## [1] "Washington, George" "Washington, George" "Adams, John"
## [4] "Jefferson, Thomas" "Jefferson, Thomas" "Madison, James"
## [7] "Madison, James" "Monroe, James" "Monroe, James"
## [10] "Adams, John Quincy"
If we make that set unique, then we can iterate across the unique presidents to subset them one at a time. (This is what you will do with each of your organizations.) We can do this using corpus_subset()
before creating the dfm, and within that, select just adjacent year pairs. The sorting of the years means that the i and i+1 will be adjacent. Most of the presidents have only a pair of years, but Franklin Roosevelt who had four inaugural addresses has three pairs. And single-term presidents, such as Carter 1977, do not have any pairs.
simpairs <- lapply(unique(data_corpus_inaugural$president), function(x) {
dfmat <- corpus_subset(data_corpus_inaugural, president == x) %>%
dfm(remove_punct = TRUE)
df <- data.frame()
years <- sort(dfmat$Year)
for (i in seq_along(years)[-length(years)]) {
sim <- textstat_simil(
dfm_subset(dfmat, Year %in% c(years[i], years[i + 1])),
method = "jaccard"
)
df <- rbind(df, as.data.frame(sim))
}
df
})
Now when we join them, you can see that we have computed only what we need.
do.call(rbind, simpairs)
## document1 document2 jaccard
## 1 1789-Washington 1793-Washington 0.09250399
## 2 1801-Jefferson 1805-Jefferson 0.20512821
## 3 1809-Madison 1813-Madison 0.20138889
## 4 1817-Monroe 1821-Monroe 0.29436202
## 5 1829-Jackson 1833-Jackson 0.20693928
## 6 1861-Lincoln 1865-Lincoln 0.14055885
## 7 1869-Grant 1873-Grant 0.20981595
## 8 1885-Cleveland 1893-Cleveland 0.23037543
## 9 1897-McKinley 1901-McKinley 0.25031211
## 10 1913-Wilson 1917-Wilson 0.21285564
## 11 1933-Roosevelt 1937-Roosevelt 0.20956522
## 12 1937-Roosevelt 1941-Roosevelt 0.20081549
## 13 1941-Roosevelt 1945-Roosevelt 0.18740157
## 14 1953-Eisenhower 1957-Eisenhower 0.21566976
## 15 1969-Nixon 1973-Nixon 0.23451777
## 16 1981-Reagan 1985-Reagan 0.24381368
## 17 1993-Clinton 1997-Clinton 0.24199623
## 18 2001-Bush 2005-Bush 0.24170616
## 19 2009-Obama 2013-Obama 0.24739195
For computing similarity you might want to add more options to the dfm creation line - I only removed punctuation here but you could also remove stopwords, numbers, etc. if that is what you want.
来源:https://stackoverflow.com/questions/61626262/how-to-compute-similarity-in-quanteda-between-documents-for-adjacent-years-only