Remove a verb as a stopword

前端 未结 3 1606
谎友^
谎友^ 2021-01-07 07:29

There are some words which are used sometimes as a verb and sometimes as other part of speech.

Example

A sentence with the meaning of the w

相关标签:
3条回答
  • 2021-01-07 07:53

    You can install TreeTagger and then use the koRpus package in R to use TreeTagger from R. Install it in a location like e.g. C:\Treetagger.

    I will first show how treetagger works so you understand what's going in the actual solution further down below in this answer:

    Intro treetagger

    library(koRpus)
    
    your_sentences <- c("I blame myself for what happened", 
                        "For what happened the blame is yours")
    
    text.tagged <- treetag(file="I blame myself for what happened", 
                      format="obj", treetagger="manual", lang="en",
                      TT.options = list(path="C:\\Treetagger", preset="en") )
    text.tagged@TT.res[, 1:2]
    #       token tag    
    #1         I  PP
    #2     blame VVP 
    #3    myself  PP 
    #4       for  IN
    #5      what  WP
    #6  happened VVD 
    

    The sentences have been analysed now and the "only thing left" is to remove those occurrences of "blame" that are a verb.

    Solution

    I'll do this sentence for sentence by creating a function that first tags the sentence, then checks for "bad words" like "blame" that are also a verb and finally removes them from the sentence:

    remove_words <- function(sentence, badword="blame"){
      tagged.text <- treetag(file=sentence, format="obj", treetagger="manual", lang="en", 
                             TT.options=list(path=":C\\Treetagger", preset="en"))
      # Check for bad words AND verb:
      cond1 <- (tagged.text@TT.res$token == badword)
      cond2 <- (substring(tagged.text@TT.res$tag, 0, 1) == "V")
      redflag <- which(cond1 & cond2)
    
      # If no such case, return sentence as is. If so, then remove that word:
      if(length(redflag) == 0) return(sentence)
      else{
        splitsent <- strsplit(sentence, " ")[[1]]
        splitsent <- splitsent[-redflag]
        return(paste0(splitsent, collapse=" "))
      }
    }
    
    lapply(your_sentences, remove_words)
    # [[1]]
    # [1] "I myself for what happened"
    # [[2]]
    # [1] "For what happened the blame is yours"
    
    0 讨论(0)
  • 2021-01-07 08:01

    You can do something like this in Python .

    import ntlk
    >>> text = word_tokenize("And now for something completely different")
    >>> nltk.pos_tag(text)
    [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
    ('completely', 'RB'), ('different', 'JJ')]
    

    And add youre filter to eliminate Verbs for instance .

    Hope this is helpful !

    0 讨论(0)
  • 2021-01-07 08:11

    In python it is done as:

    from nltk import pos_tag
    s1 = "I blame myself for what happened"
    pos_tag(s1.split())
    

    It will give you words with there tags

    0 讨论(0)
提交回复
热议问题