R tm package vcorpus: Error in converting corpus to data frame

后端 未结 5 2063
轻奢々
轻奢々 2020-12-01 09:51

I am using the tm package to clean up some data using the following code:

mycorpus <- Corpus(VectorSource(x))
mycorpus <- tm_map(mycorpus,         


        
相关标签:
5条回答
  • 2020-12-01 10:04

    This is an alternative approach I've used in my own work with text analytics. Essentially, you refer to your document term matrix as a matrix when converting it into a data frame - after which you can run an additional line that makes your variable names R-friendly.

    database <- as.data.frame(as.matrix(mycorpus))

    colnames(database) <- make.names(colnames(database))

    I'm not sure how (or if) this approach differs from the other answers in terms of output but I find this syntax much more straightforward and simpler to implement. Hope this helps!

    0 讨论(0)
  • 2020-12-01 10:15

    Your corpus is really just a character vector with some extra attributes. So it's best to convert it to character, then you can save that to a data.frame like so:

    library(tm)
    x <- c("Hello. Sir!","Tacos? On Tuesday?!?")
    mycorpus <- Corpus(VectorSource(x))
    mycorpus <- tm_map(mycorpus, removePunctuation)
    
    dataframe <- data.frame(text=unlist(sapply(mycorpus, `[`, "content")), 
        stringsAsFactors=F)
    

    which returns

                  text
    1        Hello Sir
    2 Tacos On Tuesday
    

    UPDATE: With newer version of tm, they seem to have updated the as.list.SimpleCorpus method which really messes with using sapplyand lapply. Now I guess you'd have to use

    dataframe <- data.frame(text=sapply(mycorpus, identity), 
        stringsAsFactors=F)
    
    0 讨论(0)
  • You can convert to data.frame, sort the most frequent words and plot in a wordcloud!

    library(tm)
    library("wordcloud")
    library("RColorBrewer")
    
    x <- c("Hello. Sir!","Tacos? On Tuesday?!?", "Hello")
    mycorpus <- Corpus(VectorSource(x))
    mycorpus <- tm_map(mycorpus, removePunctuation)
    
    dtm <- TermDocumentMatrix(mycorpus)
    m <- as.matrix(dtm)
    v <- sort(rowSums(m),decreasing=TRUE)
    d <- data.frame(word = names(v),freq=v)
    head(d, 10)
    
    #           word freq
    #hello     hello    2
    #sir         sir    1
    #tacos     tacos    1
    #tuesday tuesday    1
    
    #plot in a wordcloud
    set.seed(1234)
    wordcloud(words = d$word, freq = d$freq, min.freq = 1,
              max.words=200, random.order=FALSE, rot.per=0.35, 
              colors=brewer.pal(8, "Dark2"))
    

    0 讨论(0)
  • 2020-12-01 10:20

    The Corpus classed objected has a content attribute accessible through get:

    library("tm")
    
    x <- c("Hello. Sir!","Tacos? On Tuesday?!?")
    mycorpus <- Corpus(VectorSource(x))
    mycorpus <- tm_map(mycorpus, removePunctuation)
    
    attributes(mycorpus)
    # $names
    # [1] "content" "meta"    "dmeta"  
    # 
    # $class
    # [1] "SimpleCorpus" "Corpus"      
    # 
    
    df <- data.frame(text = get("content", mycorpus))
    
    head(df)
    #               text
    # 1        Hello Sir
    # 2 Tacos On Tuesday
    
    0 讨论(0)
  • 2020-12-01 10:22

    The older answer posted by MrFlick works only in previous version on tm, I was able to fix it by removing content from the formula.

    dataframe<-data.frame(text=unlist(sapply(mycorpus, `[`)), stringsAsFactors=F)
    
    0 讨论(0)
提交回复
热议问题