counting the occurrence of substrings in a column in R with group by

大憨熊 提交于 2019-12-01 06:28:33

问题


I would like to count the occurrences of a string in a column ....per group. In this case the string is often a substring in a character column.

I have some data e.g.

ID   String              village
1    fd_sec, ht_rm,      A
2    NA, ht_rm           A
3    fd_sec,             B
4    san, ht_rm,         C

The code that I began with is obviously incorrect, but I am failing on my search to find out I could use the grep function in a column and group by village

impacts <- se %>%  group_by(village) %>%
summarise(c_NA = round(sum(sub$en41_1 ==  "NA")),
          c_ht_rm = round(sum(sub$en41_1 ==  "ht_rm")),
          c_san = round(sum(sub$en41_1 ==  "san")),
          c_fd_sec = round(sum(sub$en41_1 ==  "fd_sec")))

Ideally my output would be:

village  fd_sec  NA  ht_rm  san
A        1       1   2 
B        1
C                    1      1

Thank you in advance


回答1:


We can do this with base R by splitting the 'String' column with 'village', then split the 'String' into substrings by splitting at , followed by zero or more spaces (\\s*), stack the list into a two column data.frame and get the frequency with table

table(stack(lapply(split(df1$String, df1$village), 
            function(x) unlist(strsplit(x, ",\\s*"))))[2:1])
#  values
#ind fd_sec ht_rm NA san   
#  A      1     2  1   0
#  B      1     0  0   0
#  C      0     1  0   1

Or using tidyverse, after grouping by 'village', reshape into 'long' format by splitting the 'String' using separate_rows, filter out the rows that have blank values in 'String', count the frequency and spread it to 'wide' format

library(dplyr)
library(tidyr)
df1 %>%
   group_by(village) %>% 
   separate_rows(String, sep=",\\s*") %>%
   filter(nzchar(String)) %>% 
   count(village, String) %>% 
   spread(String, n, fill = 0)
# A tibble: 3 x 5
# Groups: village [3]
#  village fd_sec ht_rm  `NA`   san
#* <chr>    <dbl> <dbl> <dbl> <dbl>
#1 A         1.00  2.00  1.00  0   
#2 B         1.00  0     0     0   
#3 C         0     1.00  0     1.00



回答2:


You can also use cSplit() from my "splitstackshape" package. Since this package also loads "data.table", you can then just use dcast() to tabulate the result.

Example:

library(splitstackshape)
cSplit(mydf, "String", direction = "long")[, dcast(.SD, village ~ String)]
# Using 'village' as value column. Use 'value.var' to override
# Aggregate function missing, defaulting to 'length'
#    village fd_sec ht_rm san NA
# 1:       A      1     2   0  1
# 2:       B      1     0   0  0
# 3:       C      0     1   1  0


来源:https://stackoverflow.com/questions/48867843/counting-the-occurrence-of-substrings-in-a-column-in-r-with-group-by

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!