counting the occurrence of substrings in a column in R with group by

前端 未结 2 974
攒了一身酷
攒了一身酷 2021-01-14 03:43

I would like to count the occurrences of a string in a column ....per group. In this case the string is often a substring in a character column.

I have some data e.

相关标签:
2条回答
  • 2021-01-14 04:26

    You can also use cSplit() from my "splitstackshape" package. Since this package also loads "data.table", you can then just use dcast() to tabulate the result.

    Example:

    library(splitstackshape)
    cSplit(mydf, "String", direction = "long")[, dcast(.SD, village ~ String)]
    # Using 'village' as value column. Use 'value.var' to override
    # Aggregate function missing, defaulting to 'length'
    #    village fd_sec ht_rm san NA
    # 1:       A      1     2   0  1
    # 2:       B      1     0   0  0
    # 3:       C      0     1   1  0
    
    0 讨论(0)
  • 2021-01-14 04:33

    We can do this with base R by splitting the 'String' column with 'village', then split the 'String' into substrings by splitting at , followed by zero or more spaces (\\s*), stack the list into a two column data.frame and get the frequency with table

    table(stack(lapply(split(df1$String, df1$village), 
                function(x) unlist(strsplit(x, ",\\s*"))))[2:1])
    #  values
    #ind fd_sec ht_rm NA san   
    #  A      1     2  1   0
    #  B      1     0  0   0
    #  C      0     1  0   1
    

    Or using tidyverse, after grouping by 'village', reshape into 'long' format by splitting the 'String' using separate_rows, filter out the rows that have blank values in 'String', count the frequency and spread it to 'wide' format

    library(dplyr)
    library(tidyr)
    df1 %>%
       group_by(village) %>% 
       separate_rows(String, sep=",\\s*") %>%
       filter(nzchar(String)) %>% 
       count(village, String) %>% 
       spread(String, n, fill = 0)
    # A tibble: 3 x 5
    # Groups: village [3]
    #  village fd_sec ht_rm  `NA`   san
    #* <chr>    <dbl> <dbl> <dbl> <dbl>
    #1 A         1.00  2.00  1.00  0   
    #2 B         1.00  0     0     0   
    #3 C         0     1.00  0     1.00
    
    0 讨论(0)
提交回复
热议问题