multiple results of one variable when applying tm method “stemCompletion”

前端 未结 1 1184
长情又很酷
长情又很酷 2021-01-14 07:06

I have a corpus containing journal data of 15 observations of 3 variables (ID, title, abstract). Using R Studio I read in the data from a .csv file (one line per observation

相关标签:
1条回答
  • 2021-01-14 08:00

    There seems to be something odd about the stemCompletion function. It's not obvious how to use stemCompletion in the tm version 0.6. There is a nice workaround here that I've used for this answer.

    First, make the CSV file that you have:

    dat <- read.csv2( text = 
                      "ID;Text A;Text B
    1;Below is the first title;Innovation and Knowledge Management
    2;And now the second Title;Organizational Performance and Learning are very important
    3;The third title;Knowledge plays an important rule in organizations")
    
    write.csv2(dat, "Test.csv", row.names = FALSE)
    

    Read it in, transform to a corpus, and stem the words:

    data = read.csv2("Test.csv")
    data[,2]=as.character(data[,2])
    data[,3]=as.character(data[,3])
    
    corpus <- Corpus(DataframeSource(data)) 
    corpuscopy <- corpus
    library(SnowballC)
    corpus <- tm_map(corpus, stemDocument)
    

    Have a look to see that it's worked:

    inspect(corpus)
    
    <<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>
    
    [[1]]
    <<PlainTextDocument (metadata: 7)>>
    1
    Below is the first titl
    Innovat and Knowledg Manag
    
    [[2]]
    <<PlainTextDocument (metadata: 7)>>
    2
    And now the second Titl
    Organiz Perform and Learn are veri import
    
    [[3]]
    <<PlainTextDocument (metadata: 7)>>
    3
    The third titl
    Knowledg play an import rule in organ
    

    Here's the nice workaround to get stemCompletion working:

    stemCompletion_mod <- function(x,dict=corpuscopy) {
      PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
    }
    

    Inspect the output to see if the stems were completed ok:

    lapply(corpus, stemCompletion_mod)
    
    [[1]]
    <<PlainTextDocument (metadata: 7)>>
    1 Below is the first title Innovation and Knowledge Management
    
    [[2]]
    <<PlainTextDocument (metadata: 7)>>
    2 And now the second Title Organizational Performance and Learning are NA important
    
    [[3]]
    <<PlainTextDocument (metadata: 7)>>
    3 The third title Knowledge plays an important rule in organizations
    

    Success!

    0 讨论(0)
提交回复
热议问题