counting the occurrence of substrings in a column in R with group by

前端未结

关注

 2  974

I would like to count the occurrences of a string in a column ....per group. In this case the string is often a substring in a character column.

I have some data e.

相关标签:

2条回答

不思量自难忘°

2021-01-14 04:26

You can also use cSplit() from my "splitstackshape" package. Since this package also loads "data.table", you can then just use dcast() to tabulate the result.

Example:

library(splitstackshape)
cSplit(mydf, "String", direction = "long")[, dcast(.SD, village ~ String)]
# Using 'village' as value column. Use 'value.var' to override
# Aggregate function missing, defaulting to 'length'
#    village fd_sec ht_rm san NA
# 1:       A      1     2   0  1
# 2:       B      1     0   0  0
# 3:       C      0     1   1  0

0 讨论(0)

北恋

2021-01-14 04:33

We can do this with base R by splitting the 'String' column with 'village', then split the 'String' into substrings by splitting at , followed by zero or more spaces (\\s*), stack the list into a two column data.frame and get the frequency with table

table(stack(lapply(split(df1$String, df1$village), 
            function(x) unlist(strsplit(x, ",\\s*"))))[2:1])
#  values
#ind fd_sec ht_rm NA san   
#  A      1     2  1   0
#  B      1     0  0   0
#  C      0     1  0   1

Or using tidyverse, after grouping by 'village', reshape into 'long' format by splitting the 'String' using separate_rows, filter out the rows that have blank values in 'String', count the frequency and spread it to 'wide' format

library(dplyr)
library(tidyr)
df1 %>%
   group_by(village) %>% 
   separate_rows(String, sep=",\\s*") %>%
   filter(nzchar(String)) %>% 
   count(village, String) %>% 
   spread(String, n, fill = 0)
# A tibble: 3 x 5
# Groups: village [3]
#  village fd_sec ht_rm  `NA`   san
#* <chr>    <dbl> <dbl> <dbl> <dbl>
#1 A         1.00  2.00  1.00  0   
#2 B         1.00  0     0     0   
#3 C         0     1.00  0     1.00

0 讨论(0)