stemDocment in tm package not working on past tense word

后端 未结 1 498
生来不讨喜
生来不讨喜 2020-12-21 18:10

I have a file \'check_text.txt\' that contains \"said say says make made\". I\'d like to perform stemming on it to get \"say say say make make\". I tried to

相关标签:
1条回答
  • 2020-12-21 18:22

    If there is a data set of irregular English verbs in a package, this task would be easy. I just do not know any packages with such data, so I chose to create my own database by scraping. I am not sure if this website covers all irregular words. If necessary, you want to search better websites to create your own database. Once you have your database, You can engage in your task.

    First, I used stemDocument() and clean up present forms with -s. Then, I collected past forms in words (i.e., past), infinitive forms of the past forms (i.e., inf1),identified the order of the past forms in temp. I further identified the positions of the past forms in temp. I finally replaced the sat forms with their infinitive forms. I repeated the same procedure for past participles.

    library(tm)
    library(rvest)
    library(dplyr)
    library(splitstackshape)
    
    
    ### Create a database
    x <- read_html("http://www.englishpage.com/irregularverbs/irregularverbs.html")
    
    x %>%
    html_table(header = TRUE) %>%
    bind_rows %>%
    rename(Past = `Simple Past`, PP = `Past Participle`) %>%
    filter(!Infinitive %in% LETTERS) %>%
    cSplit(splitCols = c("Past", "PP"),
           sep = " / ", direction = "long") %>%
    filter(complete.cases(.)) %>%
    mutate_each(funs(gsub(pattern = "\\s\\(.*\\)$|\\s\\[\\?\\]",
                          replacement = "",
                          x = .))) -> mydic
    
    ### Work on the task
    
    words <- c("said", "drawn", "say", "says", "make", "made", "done")
    
    ### says to say
    temp <- stemDocument(words)
    
    ### past forms become present form
    ### Collect past forms
    past <- mydic$Past[which(mydic$Past %in% temp)]
    
    ### Collect infinitive forms of past forms
    inf1 <- mydic$Infinitive[which(mydic$Past %in% temp)]
    
    ### Identify the order of past forms in temp
    ind <- match(temp, past)
    ind <- ind[is.na(ind) == FALSE]
    
    ### Where are the past forms in temp?
    position <- which(temp %in% past)
    
    temp[position] <- inf1[ind]
    
    ### Check
    temp
    #[1] "say"   "drawn" "say"   "say"   "make"  "make"  "done" 
    
    
    ### PP forms to infinitive forms (same as past forms)
    
    pp <- mydic$PP[which(mydic$PP %in% temp)]
    inf2 <- mydic$Infinitive[which(mydic$PP %in% temp)]
    ind <- match(temp, pp)
    ind <- ind[is.na(ind) == FALSE]
    position <- which(temp %in% pp)
    temp[position] <- inf2[ind]
    
    ### Check
    temp
    #[1] "say"  "draw" "say"  "say"  "make" "make" "do" 
    
    0 讨论(0)
提交回复
热议问题