I\'m looking to assign some different readability scores to text in R such as the Flesh Kincaid.
Does anyone know of a way to segment words into syllables using R? I
qdap version 1.1.0 does this task:
library(qdap)
x <- c('dog', 'cat', 'pony', 'cracker', 'shoe', 'Popsicle')
syllable_sum(x)
## [1] 1 1 2 2 1 3
Some tools for NLP are available here:
http://cran.r-project.org/web/views/NaturalLanguageProcessing.html
The task is non-trivial though. More hints (including an algorithm you could implement) here:
Detecting syllables in a word
gsk3 is correct: if you want a correct solution, it is non-trivial.
For example, you have to watch out for strange things like silent e at the end of a word (eg pane), or know when it's not silent, as in finale.
However, if you just want a quick-and-dirty approximation, this will do it:
> nchar( gsub( "[^X]", "", gsub( "[aeiouy]+", "X", tolower( x ))))
[1] 1 1 2 2 1 3
To understand how the parts work, just strip away the function calls from the outside in, starting with nchar
and then gsub
, etc... ...until the expression makes sense to you.
But my guess is, considering a fight between R's power vs the profusion of exceptions in the English language, you could get a decent answer (maybe 99% right?) parsing through normal text, without a lot of work - heck, the simple parser above may get 90%+ right. With a little more work, you could deal with silent e's if you like.
It all depends on your application - whether this is good enough or you need something more accurate.
The koRpus package will help you out immensley, but it's a little difficult to work with.
stopifnot(require(koRpus))
tokens <- tokenize(text, format="obj", lang='en')
flesch.kincaid(tokens)