问题
I am using the quanteda package and the very good tutorials that have been written about it to make various operations on paper articles.
I obtained the frequency of specific words over time by selecting them in a mainwordsDFM and using
textstat_frequency(mainwordsDFM, group = "Date")
, then converted the result into a dataframe, and plotted with ggplot.
However, I now try to plot the frequency of a word over time and by paper.
The solution I used on my previous operation does not work in this case, because it is only possible to include one variable to group the result of the frequency analysis.
I was therefore wondering if it is possible to convert the mainwordsDFM to a dataframe, but when I do so with convert(mainwordsDFM, to = "data.frame")
, the docVars, which are included in the dfm, disappear, leaving only the occurrences of the selected word.
Is there a way to convert this dfm into a dataframe without losing the docVars ?
As you may have understood, I am interested in converting the dfm because it allows me to keep specific words, when my original dataframe (from where I made the corpus, then token, then dfm) had entire texts.
I doubt of its utility but here is the dput of the head of my dfm :
new("dfm", settings = list(), weightTf = list(scheme = "count",
base = NULL, K = NULL), weightDf = list(scheme = "unary",
base = NULL, c = NULL, smoothing = NULL, threshold = NULL),
smooth = 0, ngrams = 1L, skip = 0L, concatenator = "_", version = c(1L,
5L, 2L), docvars = structure(list(Date = structure(c(9132,
9136, 9136, 9141, 9141, 9142), class = "Date"), Journal = c("Libération",
"Libération", "Libération", "Libération", "Le Monde", "La Tribune (France)"
), Titre = c("Autriche, Finlande et Suède, trois nouveaux prêts à jouer les bons élèves",
"La Suède fait ses débuts dans l'Union européenne en passant par Paris",
"1994: Année gay?", "\"\"\"\"Le Péril jeune\"\"\"\" fait table rase des années 70",
"OLYMPISME Un comité contre la discrimination des athlètes musulmanes a été créé \"\"\"\"Atlanta Plus\"\"\"\" lutte pour l'exclusion des J.O. de 1996 des délégations exclusivement masculines",
"La démonstration de force des eurodéputés"), Auteur = c("MILLOT Lorraine",
"MILLOT Lorraine", "REMES Erik", "PERON Didier", "AULAGNON MICHELE",
NA), Year = structure(c(9131, 9131, 9131, 9131, 9131, 9131
), class = "Date"), mois = structure(c(9131, 9131, 9131,
9131, 9131, 9131), class = "Date")), row.names = c("1", "2",
"3", "4", "5", "6"), class = "data.frame"), i = 2:4, p = c(0L,
1L, 2L, 3L, 3L), Dim = c(6L, 4L), Dimnames = list(docs = c("1",
"2", "3", "4", "5", "6"), features = c("sexisme", "féminisme",
"droitsdesfemmes", "égalitédessexes")), x = c(1, 2, 1), factors = list())
And here is the str :
Formal class 'dfm' [package "quanteda"] with 15 slots
..@ settings : list()
..@ weightTf :List of 3
.. ..$ scheme: chr "count"
.. ..$ base : NULL
.. ..$ K : NULL
..@ weightDf :List of 5
.. ..$ scheme : chr "unary"
.. ..$ base : NULL
.. ..$ c : NULL
.. ..$ smoothing: NULL
.. ..$ threshold: NULL
..@ smooth : num 0
..@ ngrams : int 1
..@ skip : int 0
..@ concatenator: chr "_"
..@ version : int [1:3] 1 5 2
..@ docvars :'data.frame': 16014 obs. of 6 variables:
.. ..$ Date : Date[1:16014], format: "1995-01-02" "1995-01-06" "1995-01-06" "1995-01-11" ...
.. ..$ Journal: chr [1:16014] "Libération" "Libération" "Libération" "Libération" ...
.. ..$ Titre : chr [1:16014] "Autriche, Finlande et Suède, trois nouveaux prêts à jouer les bons élèves" "La Suède fait ses débuts dans l'Union européenne en passant par Paris" "1994: Année gay?" "\"\"\"\"Le Péril jeune\"\"\"\" fait table rase des années 70" ...
.. ..$ Auteur : chr [1:16014] "MILLOT Lorraine" "MILLOT Lorraine" "REMES Erik" "PERON Didier" ...
.. ..$ Year : Date[1:16014], format: "1995-01-01" "1995-01-01" "1995-01-01" "1995-01-01" ...
.. ..$ mois : Date[1:16014], format: "1995-01-01" "1995-01-01" "1995-01-01" "1995-01-01" ...
..@ i : int [1:14822] 2 10 13 14 18 19 20 24 25 26 ...
..@ p : int [1:5] 0 2935 8389 14690 14822
..@ Dim : int [1:2] 16014 4
..@ Dimnames :List of 2
.. ..$ docs : chr [1:16014] "1" "2" "3" "4" ...
.. ..$ features: chr [1:4] "sexisme" "féminisme" "droitsdesfemmes" "égalitédessexes"
..@ x : num [1:14822] 1 2 1 1 1 1 1 1 1 1 ...
..@ factors : list()
Thank you very much, Regards
回答1:
Assuming your dfm
is called test
, you can just do:
library(magrittr)
test %>%
convert(to = "data.frame") %>%
cbind(docvars(test))
Or without the pipe:
cbind(convert(test, to = "data.frame"), docvars(test))
As far as I know this is the only way as convert
does not extract document variables.
来源:https://stackoverflow.com/questions/60419692/how-to-convert-dfm-into-dataframe-but-keeping-docvars