Base word stemming instead of root word stemming in R

前端 未结 4 446
离开以前
离开以前 2021-02-05 18:10

Is there any way to get base word instead of root word in stemming using NLP in R?

Code:

> #Loading libraries
> library(tm)
> library(slam)
>         


        
4条回答
  •  情书的邮戳
    2021-02-05 18:44

    Without a good knowledge of English morphology, you would have to use an existing library rather than create your own stemmer.

    English is full of unexpected morphological surprises that would affect both probabilistic and rule-based models. Some examples are:

    • Removing an in- prefix to remove an -able suffix, as in inhabitable.
    • Change of the word's category, as in the noun bicycle resulting from stemming the verb bicycling (can affect rules based on categories).
    • Words with negative meanings cannot take negative prefixes (you can have unpretty, but not unugly).
    • Two words as a compound, as in "truck driver" (you would treat them as one word when you stem).

    English also has an issue with I-umlaut, where words like men, geese, feet, best, and a host of other words (all with an 'e'-like sound) cannot be easily stemmed. Stemming foreign, borrowed words, like automaton, may also be an issue.

    Stemming the superlative form is a good example of exceptions:

    best -> good

    eldest -> old

    A lemmatizer would account for such exceptions, but would be slower. You can look at the Porter stemmer rules to get an idea of what you need, or you can just use its SnowballC R package.

提交回复
热议问题