transmute new columns based on exact match of multiple words in string

匿名 (未验证) 提交于 2019-12-03 01:03:01

问题:

I have a data frame:

df <- data.frame(   Otherspp = c("suck SD", "BT", "SD RS", "RSS"),   Dominantspp = c("OM", "OM", "RSS", "CH"),   Commonspp = c(" ", " ", " ", "OM"),   Rarespp = c(" ", " ", "SD", "NP"),   NP = rep("northern pikeminnow|NORTHERN PIKEMINNOW|np|NP|npm|NPM", 4),   OM = rep("steelhead|STEELHEAD|rainbow trout|RAINBOW TROUT|st|ST|rb|RB|om|OM", 4),   RSS = rep("redside shiner|REDSIDE SHINER|rs|RS|rss|RSS", 4),   suck = rep("suckers|SUCKERS|sucker|SUCKER|suck|SUCK|su|SU|ss|SS", 4) ) 

I need to use the columns populated with common fish codes/names (NP, OM, RSS, suck) to evaluate the expressions in the first four columns and output a 1/0 based on each of those columns, if the expression is met EXACTLY. The code I have below does not match full words (only partial) and provides incorrect data (see resulting tibble below).

df %>%   rowwise() %>%   transmute_at(vars(NP, OM, RSS, suck),                 funs(case_when(                  grepl(., Dominantspp) ~ "1",                  grepl(., Commonspp) ~ "1",                  grepl(., Rarespp) ~ "1",                  grepl(., Otherspp) ~ "1",                  TRUE ~ "0"))) %>%   ungroup()

Result: see that in row three, both "suck" and "RSS" receive a "1".

# A tibble: 4 x 4      NP    OM   RSS  suck   <chr> <chr> <chr> <chr> 1     0     1     0     1 2     0     1     0     0 3     0     0     1     1 4     1     1     1     1

Desired output:

  NP OM RSS suck 1  0  1   0    1 2  0  1   0    0 3  0  0   1    0 4  1  1   1    0

回答1:

The fastest way to solve your problem using your same approach is to add word boundaries to the beginning and end of each of your regexes, with \\b:

df <- data.frame(   Otherspp = c("suck SD", "BT", "SD RS", "RSS"),   Dominantspp = c("OM", "OM", "RSS", "CH"),   Commonspp = c(" ", " ", " ", "OM"),   Rarespp = c(" ", " ", "SD", "NP"),   NP = rep("\\b(northern pikeminnow|NORTHERN PIKEMINNOW|np|NP|npm|NPM)\\b", 4),   OM = rep("\\b(steelhead|STEELHEAD|rainbow trout|RAINBOW TROUT|st|ST|rb|RB|om|OM\\b)", 4),   RSS = rep("\\b(redside shiner|REDSIDE SHINER|rs|RS|rss|RSS)\\b", 4),   suck = rep("\\b(suckers|SUCKERS|sucker|SUCKER|suck|SUCK|su|SU|ss|SS)\\b", 4),   stringsAsFactors = FALSE )

This makes the regular expressions only match full words, which will make your subsequent solution work.


Having said that, I don't think this is necessarily the way to approach the problem (rowwise() is rarely recommended today, and this approach won't scale well to many fish codes). I think you'd have an easier time working with this data if you standardized it to a tidy format, with one row per combination of row and code:

library(tidyr) library(tidytext)  row_codes <- df %>%   select(Otherspp:Rarespp) %>%   mutate(row = row_number()) %>%   gather(type, codes, -row) %>%   unnest_tokens(code, codes, token = "regex", pattern = " ")

Which would result in:

   row        type code 1    1 Dominantspp   om 2    1    Otherspp suck 3    1    Otherspp   sd 4    2 Dominantspp   om 5    2    Otherspp   bt 6    3 Dominantspp  rss 7    3    Otherspp   sd 8    3    Otherspp   rs 9    3     Rarespp   sd 10   4   Commonspp   om 11   4 Dominantspp   ch 12   4    Otherspp  rss 13   4     Rarespp   np

At this point, the codes are much easier to work with (you don't need regular expressions anymore). For example, you could inner_join it to a table of the fish codes.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!