Update/Replace Values in Dataframe with Tidyverse Join

后端 未结 5 2066
说谎
说谎 2021-01-01 01:29

What is the most efficient way to update/replace NAs in main dataset with (correct) values in a lookup table? This is such a common operation! Similar questions do not seem

相关标签:
5条回答
  • 2021-01-01 01:49

    Here is a single line solution with rows_update():

    df1 %>% 
      rows_update(lookup_df, by = "state_abbrev")
    

    Demo:

    library(dplyr)
    
    ### Main Dataframe ###
    df1 <- tibble(
      state_abbrev = state.abb[1:10],
      state_name = c(state.name[1:5], rep(NA, 3), state.name[9:10]),
      value = sample(500:1200, 10, replace=TRUE)
    )
    
    ### Lookup Dataframe ###
    lookup_df <- tibble(
      state_abbrev = state.abb[6:8],
      state_name = state.name[6:8]
    )
    
    df1 %>% 
      rows_update(lookup_df, by = "state_abbrev")
    #> # A tibble: 10 x 3
    #>    state_abbrev state_name  value
    #>    <chr>        <chr>       <int>
    #>  1 AL           Alabama       532
    #>  2 AK           Alaska        640
    #>  3 AZ           Arizona       521
    #>  4 AR           Arkansas      523
    #>  5 CA           California    970
    #>  6 CO           Colorado      695
    #>  7 CT           Connecticut   504
    #>  8 DE           Delaware     1088
    #>  9 FL           Florida       979
    #> 10 GA           Georgia      1059
    
    0 讨论(0)
  • 2021-01-01 01:49

    If the abbreviation column is complete and the lookup table is complete, could you just drop the state_name column and then join?

    left_join(df1 %>% select(-state_name), lookup_df, by = 'state_abbrev') %>% 
      select(state_abbrev, state_name, value)
    

    Another option could be to use match and if_else in a mutate call using the built in state name and abbreviation lists:

    df1 %>% 
      mutate(state_name = if_else(is.na(state_name), state.name[match(state_abbrev,state.abb)], state_name))
    

    Both give the same output:

    # A tibble: 10 x 3
       state_abbrev state_name  value
       <chr>        <chr>       <int>
     1 AL           Alabama       525
     2 AK           Alaska        719
     3 AZ           Arizona      1186
     4 AR           Arkansas     1051
     5 CA           California    888
     6 CO           Colorado      615
     7 CT           Connecticut   578
     8 DE           Delaware      894
     9 FL           Florida       536
    10 GA           Georgia       599       
    
    0 讨论(0)
  • 2021-01-01 01:50

    There's currently no one-shot for trying to coalesce more than one column (which can be done by using a lookup table approach within ifelse(is.na(value), ..., value)), though there has been discussion of how such behavior may be implemented. For now, you can build it manually. If you've got a lot of columns, you can coalesce programmatically, or even put it in a function.

    library(tidyverse)
    
    df1 <- tibble(
        state_abbrev = state.abb[1:10],
        state_name = c(state.name[1:5], rep(NA, 3), state.name[9:10]),
        value = sample(500:1200, 10, replace=TRUE)
    )
    
    lookup_df <- tibble(
        state_abbrev = state.abb[6:8],
        state_name = state.name[6:8]
    )
    
    df1 %>% 
        full_join(lookup_df, by = 'state_abbrev') %>% 
        bind_cols(map_dfc(grep('.x', names(.), value = TRUE), function(x){
            set_names(
                list(coalesce(.[[x]], .[[gsub('.x', '.y', x)]])), 
                gsub('.x', '', x)
            )
        })) %>% 
        select(union(names(df1), names(lookup_df)))
    #> # A tibble: 10 x 3
    #>    state_abbrev state_name  value
    #>    <chr>        <chr>       <int>
    #>  1 AL           Alabama       877
    #>  2 AK           Alaska       1048
    #>  3 AZ           Arizona       973
    #>  4 AR           Arkansas      860
    #>  5 CA           California    938
    #>  6 CO           Colorado      639
    #>  7 CT           Connecticut   547
    #>  8 DE           Delaware      672
    #>  9 FL           Florida       667
    #> 10 GA           Georgia      1142
    
    0 讨论(0)
  • 2021-01-01 01:56

    in order to preserve the column order:

    df1 %>% 
      left_join(lookup_df, by = "state_abbrev") %>% 
      mutate(state_name.x = coalesce(state_name.x, state_name.y)) %>% 
      rename(state_name = state_name.x) %>%
      select(-state_name.y)
    
    0 讨论(0)
  • 2021-01-01 02:08

    Picking up Alistaire's and Nettle's suggestions and transforming into a working solution

    df1 %>% 
      left_join(lookup_df, by = "state_abbrev") %>% 
      mutate(state_name = coalesce(state_name.x, state_name.y)) %>% 
      select(-state_name.x, -state_name.y)
    
    # A tibble: 10 x 3
       state_abbrev value state_name 
       <chr>        <int> <chr>      
     1 AL             671 Alabama    
     2 AK             501 Alaska     
     3 AZ            1030 Arizona    
     4 AR             694 Arkansas   
     5 CA             881 California 
     6 CO             821 Colorado   
     7 CT             742 Connecticut
     8 DE             665 Delaware   
     9 FL             948 Florida    
    10 GA             790 Georgia
    

    The OP has stated to prefer a "tidyverse" solution. However, update joins are already available with the data.table package:

    library(data.table)
    setDT(df1)[setDT(lookup_df), on = "state_abbrev", state_name := i.state_name]
    df1
    
        state_abbrev  state_name value
     1:           AL     Alabama  1103
     2:           AK      Alaska  1036
     3:           AZ     Arizona   811
     4:           AR    Arkansas   604
     5:           CA  California   868
     6:           CO    Colorado  1129
     7:           CT Connecticut   819
     8:           DE    Delaware  1194
     9:           FL     Florida   888
    10:           GA     Georgia   501
    

    Benchmark

    library(bench)
    bm <- press(
      na_share = c(0.1, 0.5, 0.9),
      n_row = length(state.abb) * 2 * c(1, 100, 10000),
      {
        n_na <- na_share * length(state.abb)
        set.seed(1)
        na_idx <- sample(length(state.abb), n_na)
        tmp <- data.table(state_abbrev = state.abb, state_name = state.name)
        lookup_df <-tmp[na_idx] 
        tmp[na_idx, state_name := NA]
        df0 <- as_tibble(tmp[sample(length(state.abb), n_row, TRUE)])
        mark(
          dplyr = {
            df1 <- copy(df0)
            df1 <- df1 %>% 
              left_join(lookup_df, by = "state_abbrev") %>% 
              mutate(state_name = coalesce(state_name.x, state_name.y)) %>% 
              select(-state_name.x, -state_name.y)
            df1
          },
          upd_join = {
            df1 <- copy(df0)
            setDT(df1)[setDT(lookup_df), on = "state_abbrev", state_name := i.state_name]
            df1
          }
        )
      }
    )
    ggplot2::autoplot(bm)
    

    data.table's upate join is always faster (note the log time scale).

    As the update join modifies the data object, a fresh copy is used for each benchmark run.

    0 讨论(0)
提交回复
热议问题