How to vectorize R strsplit?

后端 未结 1 738
[愿得一人]
[愿得一人] 2021-02-01 07:29

When creating functions that use strsplit, vector inputs do not behave as desired, and sapply needs to be used. This is due to the list output that

相关标签:
1条回答
  • 2021-02-01 07:58

    In general, you should try to use a vectorized function to begin with. Using strsplit will frequently require some kind of iteration afterwards (which will be slower), so try to avoid it if possible. In your example, you should use nchar instead:

    > nchar(words)
    [1] 1 5 5 3
    

    More generally, take advantage of the fact that strsplit returns a list and use lapply:

    > as.numeric(lapply(strsplit(words,""), length))
    [1] 1 5 5 3
    

    Or else use an l*ply family function from plyr. For instance:

    > laply(strsplit(words,""), length)
    [1] 1 5 5 3
    

    Edit:

    In honor of Bloomsday, I decided to test the performance of these approaches using Joyce's Ulysses:

    joyce <- readLines("http://www.gutenberg.org/files/4300/4300-8.txt")
    joyce <- unlist(strsplit(joyce, " "))
    

    Now that I have all the words, we can do our counts:

    > # original version
    > system.time(print(summary(sapply(joyce, function (x) length(strsplit(x,"")[[1]])))))
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0.000   3.000   4.000   4.666   6.000  69.000 
       user  system elapsed 
       2.65    0.03    2.73 
    > # vectorized function
    > system.time(print(summary(nchar(joyce))))
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0.000   3.000   4.000   4.666   6.000  69.000 
       user  system elapsed 
       0.05    0.00    0.04 
    > # with lapply
    > system.time(print(summary(as.numeric(lapply(strsplit(joyce,""), length)))))
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0.000   3.000   4.000   4.666   6.000  69.000 
       user  system elapsed 
        0.8     0.0     0.8 
    > # with laply (from plyr)
    > system.time(print(summary(laply(strsplit(joyce,""), length))))
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0.000   3.000   4.000   4.666   6.000  69.000 
       user  system elapsed 
      17.20    0.05   17.30
    > # with ldply (from plyr)
    > system.time(print(summary(ldply(strsplit(joyce,""), length))))
           V1        
     Min.   : 0.000  
     1st Qu.: 3.000  
     Median : 4.000  
     Mean   : 4.666  
     3rd Qu.: 6.000  
     Max.   :69.000  
       user  system elapsed 
       7.97    0.00    8.03 
    

    The vectorized function and lapply are considerably faster than the original sapply version. All solutions return the same answer (as seen by the summary output).

    Apparently the latest version of plyr is faster (this is using a slightly older version).

    0 讨论(0)
提交回复
热议问题