I would like to count the occurrences of a string in a column ....per group. In this case the string is often a substring in a character column.
I have some data e.
You can also use cSplit()
from my "splitstackshape" package. Since this package also loads "data.table", you can then just use dcast()
to tabulate the result.
Example:
library(splitstackshape)
cSplit(mydf, "String", direction = "long")[, dcast(.SD, village ~ String)]
# Using 'village' as value column. Use 'value.var' to override
# Aggregate function missing, defaulting to 'length'
# village fd_sec ht_rm san NA
# 1: A 1 2 0 1
# 2: B 1 0 0 0
# 3: C 0 1 1 0
We can do this with base R
by split
ting the 'String' column with 'village', then split the 'String' into substrings by splitting at ,
followed by zero or more spaces (\\s*
), stack
the list
into a two column data.frame
and get the frequency with table
table(stack(lapply(split(df1$String, df1$village),
function(x) unlist(strsplit(x, ",\\s*"))))[2:1])
# values
#ind fd_sec ht_rm NA san
# A 1 2 1 0
# B 1 0 0 0
# C 0 1 0 1
Or using tidyverse
, after grouping by 'village', reshape into 'long' format by splitting the 'String' using separate_rows
, filter
out the rows that have blank values in 'String', count
the frequency and spread
it to 'wide' format
library(dplyr)
library(tidyr)
df1 %>%
group_by(village) %>%
separate_rows(String, sep=",\\s*") %>%
filter(nzchar(String)) %>%
count(village, String) %>%
spread(String, n, fill = 0)
# A tibble: 3 x 5
# Groups: village [3]
# village fd_sec ht_rm `NA` san
#* <chr> <dbl> <dbl> <dbl> <dbl>
#1 A 1.00 2.00 1.00 0
#2 B 1.00 0 0 0
#3 C 0 1.00 0 1.00