Is there a way to use for loops within dplyr to reduce the number of str_detect terms needed?

问题

I'm currently working on a project, and I'm looking at classifying about a hundred thousand strings, based on their content.

The goal of this code is to identify if a string matches, classify them to a particular bucket, then to save the end result to a csv. No code contains more than one matching string.

I realise that after a certain point my code gets a little unreadable - mostly because if I have to change one of say, two hundred str_detect functions with the same format, I then have to find it in my case_when, etc.

I'm looking at a way to possibly integrate for loops and if conditionals into my function to improve readability and make modifying str_detect functions easier.

I've tried swapping out the case_when/str_detect combination by defining a tibble that includes all my string classes, string terms and classifications. Following that, I've swapped out the case_when for a for loop that integrates the tibble within str_detect, pulling out a specific string condition each turn.

# Working case_when version

library(dplyr)
library(stringr)

a.str <- "(?i)Apple"
b.str <- "(?i)Banana"
c.str <- "(?i)Corn"

food_set <- read_csv("Food.csv")

food_identified <- food_set %>% mutate(
     food.type = case_when( 
          str_detect(food_set, a.str ) = TRUE ~ "A",
          str_detect(food_set, b.str ) = TRUE ~ "B",
          str_detect(food_set, c.str ) = TRUE ~ "C"
     )
)

food_classified <- write_csv(food_identified,"Food_Classified.csv")

# Failing for loop version


library(dplyr)
library(stringr)

str_options <- tribble(
~variety.str,      ~String,   ~Classification,
#-----------/-------------/-------------------
"a.str"     , "(i?)Apple" ,               "A",
"b.str"     , "(i?)Banana",               "B",
"c.str"     , "(i?)Corn"  ,               "C"
)

food_set <- read_csv("Food.csv")

food_identified <- food_set %>% mutate(
     for (k in 1:3) {
          if(str_detect(food_set, str_options[k,2]) == TRUE) {
          food.type = str_options[k,3]
     }
     break
     }
)

food_classified <- write_csv(food_identified,"Food_Classified.csv")

The case_when code runs fine - it spits out a table with two columns (food, food_type).

The for loop doesn't work - it spits out an error saying 'no applicable method for 'type' applied to an object of class "c('tbl_df','tbl','data.frame')".

Does anyone have an idea as to how I might be able to get this working?

回答1:

Here's a way that just uses one call of str_detect. The problem here is that you can't use a normal join to match because the strings might contain other characters. Here what I do is join all the strings to match into one pattern to extract with, so we have a new column that can be joined on. Note that this is safe only because you said each row would only have one matching string, though you should check this (otherwise the order of the case_when would matter). We have to escape special characters before we join the strings to match, though.

You should also make sure that my interpretation of food_set matches your actual data, or include a dput of a sample.

library(tidyverse)

food_set <- tibble(
  food_set = c("sadgad(i?)Apple", "(i?)Bananaasdgas", "hgjdndg(i?)Cornadfba")
)

str_options <- tribble(
  ~variety.str,      ~String,   ~Classification,
  #-----------/-------------/-------------------
  "a.str"     , "(i?)Apple" ,               "A",
  "b.str"     , "(i?)Banana",               "B",
  "c.str"     , "(i?)Corn"  ,               "C"
)

str_regex <- str_options$String %>%
  str_replace_all("(\\W)", "\\\\\\1") %>%
  str_c(collapse = "|")

food_set %>%
  mutate(to_match = str_extract(food_set, str_regex)) %>%
  left_join(str_options, by = c("to_match" = "String"))
#> # A tibble: 3 x 4
#>   food_set             to_match   variety.str Classification
#>   <chr>                <chr>      <chr>       <chr>         
#> 1 sadgad(i?)Apple      (i?)Apple  a.str       A             
#> 2 (i?)Bananaasdgas     (i?)Banana b.str       B             
#> 3 hgjdndg(i?)Cornadfba (i?)Corn   c.str       C

^{Created on 2019-04-27 by the reprex package (v0.2.1)}

回答2:

This could also be done with fuzzyjoin. One potential advantage / thing to watch out for is that it will join to all matching regexes.

library(tidyverse); library(fuzzyjoin)
food_set <- tibble(
  food_set = c("sadgad(i?)Apple", "(i?)Bananaasdgas", "hgjdndg(i?)Cornadfba")
)

food_set %>%
  regex_left_join(str_options, by = c("food_set" = "String"))


# A tibble: 3 x 4
  food_set             variety.str String     Classification
  <chr>                <chr>       <chr>      <chr>         
1 sadgad(i?)Apple      a.str       (i?)Apple  A             
2 (i?)Bananaasdgas     b.str       (i?)Banana B             
3 hgjdndg(i?)Cornadfba c.str       (i?)Corn   C

来源：https://stackoverflow.com/questions/55886082/is-there-a-way-to-use-for-loops-within-dplyr-to-reduce-the-number-of-str-detect

标签

dplyr

stringr