dplyr: vectorisation of substr

前端 未结 1 758
北荒
北荒 2021-01-21 23:13

Referring to question substr in dplyr %>% mutate, and to @akrun \'s answer, why do the two created columns give the same answer?

df <- data_frame(t = \'1         


        
1条回答
  •  挽巷
    挽巷 (楼主)
    2021-01-21 23:42

    The difference is in the vectorization

    substr("1234567890ABCDEFG", df$a, df$a+df$b)
    #[1] "1234567"
    substring("1234567890ABCDEFG", df$a, df$a+df$b)
    #[1] "1234567"     "23456789"    "34567890A"   "4567890ABC"  "567890ABCDE"
    

    The substr returns only a single value while the substring returns a vector of length equal to the number of rows in the dataset 'df'. As there is only a single value output, it gets recycled in the mutate. However, if we are using multiple values i.e.

    substr(rep("1234567890ABCDEFG", nrow(df)), df$a, df$a+df$b)
    #[1] "1234567"     "23456789"    "34567890A"   "4567890ABC"  "567890ABCDE"
    substring(rep("1234567890ABCDEFG", nrow(df)), df$a, df$a+df$b)
    #[1] "1234567"     "23456789"    "34567890A"   "4567890ABC"  "567890ABCDE"
    

    Then, the output is the same. In the OP's example, it gets the above output as the x in substr has the same length as start and stop. We can replicate the first output with

     df %>%
         mutate(u = substr("1234567890ABCDEFG", a, a+b),
                v = substring("1234567890ABCDEFG", a, a+b)) 
    #                 t     a     b       u           v
    #              (chr) (int) (int)   (chr)       (chr)
    #1 1234567890ABCDEFG     1     6 1234567     1234567
    #2 1234567890ABCDEFG     2     7 1234567    23456789
    #3 1234567890ABCDEFG     3     8 1234567   34567890A
    #4 1234567890ABCDEFG     4     9 1234567  4567890ABC
    #5 1234567890ABCDEFG     5    10 1234567 567890ABCDE
    

    0 讨论(0)
提交回复
热议问题