This is what my data.table/dataframe looks lke
library(data.table)
dt <- fread('
STATE ZIP
PA 19333
PA 19327
PA 19333
PA NA
PA 19355
PA 19333
PA NA
PA 19355
PA NA
')
I have three missing values in the ZIP
column. I want to fill the missing values with nonmissing sample values of ZIPs
according to their probability of occuring in the dataset. So for example ZIP 19333 occurs three times in the dataset and ZIP 19355 occurs twice in the dataset and 19327 occurs once. So ZIP 19333 has 50% probability of occurring in the dataset for PA
, and 19355 has a 33.33% chance and 19327 has a 16.17% chance of occurring. So 19333 has the highest probability of being picked in trying to fill the three missing ZIPs. The final filled dataset may look like the following where two missing values are filled by '19333' and one was filled by '19355':
STATE ZIP
PA 19333
PA 19327
PA 19333
PA 19333
PA 19355
PA 19333
PA 19333
PA 19355
PA 19355
I have more than one STATE
in my dataset. The main idea is to fill in missing ZIPs based on the probability of a ZIP occurring for a given STATE
.
Here's a way just using sample
, wrapped up in a convenience function.
sample_fill_na = function(x) {
x_na = is.na(x)
x[x_na] = sample(x[!x_na], size = sum(x_na), replace = TRUE)
return(x)
}
dt[, ZIP := sample_fill_na(ZIP), by = STATE]
来源:https://stackoverflow.com/questions/47143523/fill-missing-value-based-on-probability-of-occurrence