How can I manually set the document id in a corpus?

后端 未结 3 456
轮回少年
轮回少年 2021-01-22 06:32

I am creating a Copus from a dataframe. I pass it as a VectorSource as there is only one column I want to be used as the text source. This works find however I need

相关标签:
3条回答
  • 2021-01-22 07:02

    I know it's probably late for @user1098798, but there is a way how you can specify ids directly when creating the corpus. You need to load the data as DataframeSource() and add mapping to the columns:

    corpus = VCorpus(DataframeSource(df), readerControl = list(reader = readTabular(mapping = list(content = "textColumn", id = "ids"))))
    
    0 讨论(0)
  • 2021-01-22 07:08

    Here is a qdap approach to this problem that can handle it without the loop:

    Use qdap version >= 1.1.0 right from the get go to convert the dataframe to a Corpus and the ID tags will be automatically added.

    with(df, as.Corpus(textColumn, ids))
    
    ## <<VCorpus>>
    ## Metadata:  corpus specific: 0, document level (indexed): 3
    ## Content:  documents: 6
    
    
    ## Look around a bit
    meta(with(df, as.Corpus(textColumn, ids)), tag="id")
    inspect(with(df, as.Corpus(textColumn, ids)))
    
    0 讨论(0)
  • 2021-01-22 07:14

    Well, one simple but not very elegant way to assign your ids to your documents afterward could be the following :

    for (i in 1:length(corpus)) {
       attr(corpus[[i]], "ID") <- df$ids[i]
    }
    
    0 讨论(0)
提交回复
热议问题