How to convert DFM into dataframe BUT keeping docvars?

青春壹個敷衍的年華 提交于 2021-01-28 22:16:32

问题


I am using the quanteda package and the very good tutorials that have been written about it to make various operations on paper articles. I obtained the frequency of specific words over time by selecting them in a mainwordsDFM and using textstat_frequency(mainwordsDFM, group = "Date") , then converted the result into a dataframe, and plotted with ggplot. However, I now try to plot the frequency of a word over time and by paper. The solution I used on my previous operation does not work in this case, because it is only possible to include one variable to group the result of the frequency analysis.

I was therefore wondering if it is possible to convert the mainwordsDFM to a dataframe, but when I do so with convert(mainwordsDFM, to = "data.frame") , the docVars, which are included in the dfm, disappear, leaving only the occurrences of the selected word.

Is there a way to convert this dfm into a dataframe without losing the docVars ?
As you may have understood, I am interested in converting the dfm because it allows me to keep specific words, when my original dataframe (from where I made the corpus, then token, then dfm) had entire texts.

I doubt of its utility but here is the dput of the head of my dfm :

new("dfm", settings = list(), weightTf = list(scheme = "count", 
    base = NULL, K = NULL), weightDf = list(scheme = "unary", 
    base = NULL, c = NULL, smoothing = NULL, threshold = NULL), 
    smooth = 0, ngrams = 1L, skip = 0L, concatenator = "_", version = c(1L, 
    5L, 2L), docvars = structure(list(Date = structure(c(9132, 
    9136, 9136, 9141, 9141, 9142), class = "Date"), Journal = c("Libération", 
    "Libération", "Libération", "Libération", "Le Monde", "La Tribune (France)"
    ), Titre = c("Autriche, Finlande et Suède, trois nouveaux prêts à jouer les bons élèves", 
    "La Suède fait ses débuts dans l'Union européenne en passant par Paris", 
    "1994: Année gay?", "\"\"\"\"Le Péril jeune\"\"\"\" fait table rase des années 70", 
    "OLYMPISME   Un comité contre la discrimination des athlètes musulmanes a été créé  \"\"\"\"Atlanta Plus\"\"\"\" lutte pour l'exclusion des J.O. de 1996 des délégations exclusivement masculines", 
    "La démonstration de force des eurodéputés"), Auteur = c("MILLOT Lorraine", 
    "MILLOT Lorraine", "REMES Erik", "PERON Didier", "AULAGNON MICHELE", 
    NA), Year = structure(c(9131, 9131, 9131, 9131, 9131, 9131
    ), class = "Date"), mois = structure(c(9131, 9131, 9131, 
    9131, 9131, 9131), class = "Date")), row.names = c("1", "2", 
    "3", "4", "5", "6"), class = "data.frame"), i = 2:4, p = c(0L, 
    1L, 2L, 3L, 3L), Dim = c(6L, 4L), Dimnames = list(docs = c("1", 
    "2", "3", "4", "5", "6"), features = c("sexisme", "féminisme", 
    "droitsdesfemmes", "égalitédessexes")), x = c(1, 2, 1), factors = list())

And here is the str :

Formal class 'dfm' [package "quanteda"] with 15 slots
  ..@ settings    : list()
  ..@ weightTf    :List of 3
  .. ..$ scheme: chr "count"
  .. ..$ base  : NULL
  .. ..$ K     : NULL
  ..@ weightDf    :List of 5
  .. ..$ scheme   : chr "unary"
  .. ..$ base     : NULL
  .. ..$ c        : NULL
  .. ..$ smoothing: NULL
  .. ..$ threshold: NULL
  ..@ smooth      : num 0
  ..@ ngrams      : int 1
  ..@ skip        : int 0
  ..@ concatenator: chr "_"
  ..@ version     : int [1:3] 1 5 2
  ..@ docvars     :'data.frame':    16014 obs. of  6 variables:
  .. ..$ Date   : Date[1:16014], format: "1995-01-02" "1995-01-06" "1995-01-06" "1995-01-11" ...
  .. ..$ Journal: chr [1:16014] "Libération" "Libération" "Libération" "Libération" ...
  .. ..$ Titre  : chr [1:16014] "Autriche, Finlande et Suède, trois nouveaux prêts à jouer les bons élèves" "La Suède fait ses débuts dans l'Union européenne en passant par Paris" "1994: Année gay?" "\"\"\"\"Le Péril jeune\"\"\"\" fait table rase des années 70" ...
  .. ..$ Auteur : chr [1:16014] "MILLOT Lorraine" "MILLOT Lorraine" "REMES Erik" "PERON Didier" ...
  .. ..$ Year   : Date[1:16014], format: "1995-01-01" "1995-01-01" "1995-01-01" "1995-01-01" ...
  .. ..$ mois   : Date[1:16014], format: "1995-01-01" "1995-01-01" "1995-01-01" "1995-01-01" ...
  ..@ i           : int [1:14822] 2 10 13 14 18 19 20 24 25 26 ...
  ..@ p           : int [1:5] 0 2935 8389 14690 14822
  ..@ Dim         : int [1:2] 16014 4
  ..@ Dimnames    :List of 2
  .. ..$ docs    : chr [1:16014] "1" "2" "3" "4" ...
  .. ..$ features: chr [1:4] "sexisme" "féminisme" "droitsdesfemmes" "égalitédessexes"
  ..@ x           : num [1:14822] 1 2 1 1 1 1 1 1 1 1 ...
  ..@ factors     : list()

Thank you very much, Regards


回答1:


Assuming your dfm is called test, you can just do:

library(magrittr)
test %>% 
  convert(to = "data.frame") %>% 
  cbind(docvars(test))

Or without the pipe:

cbind(convert(test, to = "data.frame"), docvars(test))

As far as I know this is the only way as convert does not extract document variables.



来源:https://stackoverflow.com/questions/60419692/how-to-convert-dfm-into-dataframe-but-keeping-docvars

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!