I would like to count the occurrences of a string in a column ....per group. In this case the string is often a substring in a character column.
I have some data e.g.
ID String village
1 fd_sec, ht_rm, A
2 NA, ht_rm A
3 fd_sec, B
4 san, ht_rm, C
The code that I began with is obviously incorrect, but I am failing on my search to find out I could use the grep function in a column and group by village
impacts <- se %>% group_by(village) %>%
summarise(c_NA = round(sum(sub$en41_1 == "NA")),
c_ht_rm = round(sum(sub$en41_1 == "ht_rm")),
c_san = round(sum(sub$en41_1 == "san")),
c_fd_sec = round(sum(sub$en41_1 == "fd_sec")))
Ideally my output would be:
village fd_sec NA ht_rm san
A 1 1 2
B 1
C 1 1
Thank you in advance
We can do this with base R
by split
ting the 'String' column with 'village', then split the 'String' into substrings by splitting at ,
followed by zero or more spaces (\\s*
), stack
the list
into a two column data.frame
and get the frequency with table
table(stack(lapply(split(df1$String, df1$village),
function(x) unlist(strsplit(x, ",\\s*"))))[2:1])
# values
#ind fd_sec ht_rm NA san
# A 1 2 1 0
# B 1 0 0 0
# C 0 1 0 1
Or using tidyverse
, after grouping by 'village', reshape into 'long' format by splitting the 'String' using separate_rows
, filter
out the rows that have blank values in 'String', count
the frequency and spread
it to 'wide' format
library(dplyr)
library(tidyr)
df1 %>%
group_by(village) %>%
separate_rows(String, sep=",\\s*") %>%
filter(nzchar(String)) %>%
count(village, String) %>%
spread(String, n, fill = 0)
# A tibble: 3 x 5
# Groups: village [3]
# village fd_sec ht_rm `NA` san
#* <chr> <dbl> <dbl> <dbl> <dbl>
#1 A 1.00 2.00 1.00 0
#2 B 1.00 0 0 0
#3 C 0 1.00 0 1.00
You can also use cSplit()
from my "splitstackshape" package. Since this package also loads "data.table", you can then just use dcast()
to tabulate the result.
Example:
library(splitstackshape)
cSplit(mydf, "String", direction = "long")[, dcast(.SD, village ~ String)]
# Using 'village' as value column. Use 'value.var' to override
# Aggregate function missing, defaulting to 'length'
# village fd_sec ht_rm san NA
# 1: A 1 2 0 1
# 2: B 1 0 0 0
# 3: C 0 1 1 0
来源:https://stackoverflow.com/questions/48867843/counting-the-occurrence-of-substrings-in-a-column-in-r-with-group-by