counting the occurrence of substrings in a column in R with group by

邮差的信 提交于 2019-12-01 08:01:08

We can do this with base R by splitting the 'String' column with 'village', then split the 'String' into substrings by splitting at , followed by zero or more spaces (\\s*), stack the list into a two column data.frame and get the frequency with table

table(stack(lapply(split(df1$String, df1$village), 
            function(x) unlist(strsplit(x, ",\\s*"))))[2:1])
#  values
#ind fd_sec ht_rm NA san   
#  A      1     2  1   0
#  B      1     0  0   0
#  C      0     1  0   1

Or using tidyverse, after grouping by 'village', reshape into 'long' format by splitting the 'String' using separate_rows, filter out the rows that have blank values in 'String', count the frequency and spread it to 'wide' format

library(dplyr)
library(tidyr)
df1 %>%
   group_by(village) %>% 
   separate_rows(String, sep=",\\s*") %>%
   filter(nzchar(String)) %>% 
   count(village, String) %>% 
   spread(String, n, fill = 0)
# A tibble: 3 x 5
# Groups: village [3]
#  village fd_sec ht_rm  `NA`   san
#* <chr>    <dbl> <dbl> <dbl> <dbl>
#1 A         1.00  2.00  1.00  0   
#2 B         1.00  0     0     0   
#3 C         0     1.00  0     1.00

You can also use cSplit() from my "splitstackshape" package. Since this package also loads "data.table", you can then just use dcast() to tabulate the result.

Example:

library(splitstackshape)
cSplit(mydf, "String", direction = "long")[, dcast(.SD, village ~ String)]
# Using 'village' as value column. Use 'value.var' to override
# Aggregate function missing, defaulting to 'length'
#    village fd_sec ht_rm san NA
# 1:       A      1     2   0  1
# 2:       B      1     0   0  0
# 3:       C      0     1   1  0
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!