Fill up missing values using the other data?

问题

A <- data.frame(Item_A = c("00EF", "00EF", "00EF", "00EF", "00EF", "00FR", "00FR"),  
                Item_B = c(NA, NA, NA, NA, "JAMES RIVER", NA, NA))

B <- data.frame(Item_A = c("00EF", "00EF", "00EF", "00FR", "00FR"), 
                Item_B = c("JAMES RIVER", NA, "JAMES RIVER",
                           "RICE MIDSTREAM", "RICE MIDSTREAM"))

Expected:

A <- data.frame(Item_A = c("00EF", "00EF", "00EF", "00EF", "00EF", "00FR", "00FR"),  
                Item_B = c("JAMES RIVER", "JAMES RIVER", "JAMES RIVER", 
                         "JAMES RIVER", "JAMES RIVER", "RICE MIDSTREAM", "RICE MIDSTREAM"))

B <- data.frame(Item_A = c("00EF", "00EF", "00EF", "00FR", "00FR"), 
                Item_B = c("JAMES RIVER", "JAMES RIVER", "JAMES RIVER", 
                           "RICE MIDSTREAM", "RICE MIDSTREAM"))

I have to fill in item Item_B according to the Item_B of other rows where Item_A is the same. For example, the first to fourth observation of Item_B in data set A need to become "JAMES RIVER".

Can you please suggest a way to fill in the missing values in R? I tried many techniques but couldn't get what I wanted.

回答1:

As far as I have understood the question, this is not just an exercise to simply filling up missing values in one column of each data.frame. I believe this requires to fill in the values of Item_B which belong to Item_A with help of a look up or mapping table:

library(data.table)
# create mapping table from both data.frames
map <- unique(rbindlist(list(A, B)))[!is.na(Item_B)]
# or, in case there are additional columns besides Item_A and Item_B
map <- unique(rbindlist(list(A, B))[!is.na(Item_B), .(Item_A, Item_B)])
map

   Item_A         Item_B
1:   00FF    JAMES RIVER
2:   00EF    JAMES RIVER
3:   00FR RICE MIDSTREAM

# join and replace
setDT(A)[map, on = c("Item_A"), Item_B := i.Item_B][]

   Item_A         Item_B
1:   00FF    JAMES RIVER
2:   00FF    JAMES RIVER
3:   00FF    JAMES RIVER
4:   00FF    JAMES RIVER
5:   00FF    JAMES RIVER
6:   00FR RICE MIDSTREAM
7:   00FR RICE MIDSTREAM

setDT(B)[map, on = c("Item_A"), Item_B := i.Item_B][]

   Item_A         Item_B
1:   00EF    JAMES RIVER
2:   00EF    JAMES RIVER
3:   00EF    JAMES RIVER
4:   00FR RICE MIDSTREAM
5:   00FR RICE MIDSTREAM

During join, there are two columns named Item_B, one from the first data table, A (or B, resp.) and the other from the second data table map. To distinguish them, the i. prefix indicates that i.Item_B should be taken from map.

回答2:

You could try to create a dictionnary dataframe.

library(dplyr)
dictionnary <- bind_rows(A,B) %>% 
           filter(!is.na(Item_B)) %>% 
           distinct
find_name <- function(id){
  name <- dictionnary[["Item_B"]][which(dictionnary[["Item_A"]]==id)]
  return(name)
}
test_id <- c("00EF","00EF","00EF","00FR","00FR")
new_names <- sapply(test_id ,find_name )

You could then declare your dataframe :

New_A <- data.frame(Item_A=c("00FF","00FF","00FF","00FF","00FF","00FR","00FR"),
                Item_B=sapply(c("00FF","00FF","00FF","00FF","00FF","00FR","00FR"),find_name))

New_B <- data.frame(Item_A=c("00EF","00EF","00EF","00FR","00FR"), 
                Item_B=sapply(c("00EF","00EF","00EF","00FR","00FR"),find_name))

回答3:

You could try tidyr library helper fill

library(tidyr)
A %>% 
  tidyr::fill(Item_B, .direction = "down") %>% 
  tidyr::fill(Item_B, .direction = "up")

  Item_A      Item_B
1   00FF JAMES RIVER
2   00FF JAMES RIVER
3   00FF JAMES RIVER
4   00FF JAMES RIVER
5   00FF JAMES RIVER
6   00FR JAMES RIVER
7   00FR JAMES RIVER

回答4:

@YXCHEN update based on your input

lookup_df <- unique(rbindlist(list(A, B)))[!is.na(Item_B)] 

left_join(A %>% select(Item_A), lookup_df)
left_join(B %>% select(Item_A), lookup_df)

来源：https://stackoverflow.com/questions/45938081/fill-up-missing-values-using-the-other-data

标签

missing-data