When creating functions that use strsplit
, vector inputs do not behave as desired, and sapply
needs to be used. This is due to the list output that
In general, you should try to use a vectorized function to begin with. Using strsplit
will frequently require some kind of iteration afterwards (which will be slower), so try to avoid it if possible. In your example, you should use nchar
instead:
> nchar(words)
[1] 1 5 5 3
More generally, take advantage of the fact that strsplit
returns a list and use lapply
:
> as.numeric(lapply(strsplit(words,""), length))
[1] 1 5 5 3
Or else use an l*ply
family function from plyr
. For instance:
> laply(strsplit(words,""), length)
[1] 1 5 5 3
Edit:
In honor of Bloomsday, I decided to test the performance of these approaches using Joyce's Ulysses:
joyce <- readLines("http://www.gutenberg.org/files/4300/4300-8.txt")
joyce <- unlist(strsplit(joyce, " "))
Now that I have all the words, we can do our counts:
> # original version
> system.time(print(summary(sapply(joyce, function (x) length(strsplit(x,"")[[1]])))))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 3.000 4.000 4.666 6.000 69.000
user system elapsed
2.65 0.03 2.73
> # vectorized function
> system.time(print(summary(nchar(joyce))))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 3.000 4.000 4.666 6.000 69.000
user system elapsed
0.05 0.00 0.04
> # with lapply
> system.time(print(summary(as.numeric(lapply(strsplit(joyce,""), length)))))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 3.000 4.000 4.666 6.000 69.000
user system elapsed
0.8 0.0 0.8
> # with laply (from plyr)
> system.time(print(summary(laply(strsplit(joyce,""), length))))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 3.000 4.000 4.666 6.000 69.000
user system elapsed
17.20 0.05 17.30
> # with ldply (from plyr)
> system.time(print(summary(ldply(strsplit(joyce,""), length))))
V1
Min. : 0.000
1st Qu.: 3.000
Median : 4.000
Mean : 4.666
3rd Qu.: 6.000
Max. :69.000
user system elapsed
7.97 0.00 8.03
The vectorized function and lapply
are considerably faster than the original sapply
version. All solutions return the same answer (as seen by the summary output).
Apparently the latest version of plyr
is faster (this is using a slightly older version).