Fill missing value based on probability of occurrence

问题

This is what my data.table/dataframe looks lke

library(data.table)
dt <- fread('
   STATE     ZIP      
   PA        19333        
   PA        19327        
   PA        19333        
   PA        NA        
   PA        19355
   PA        19333
   PA        NA
   PA        19355
   PA        NA     
')

I have three missing values in the ZIP column. I want to fill the missing values with nonmissing sample values of ZIPs according to their probability of occuring in the dataset. So for example ZIP 19333 occurs three times in the dataset and ZIP 19355 occurs twice in the dataset and 19327 occurs once. So ZIP 19333 has 50% probability of occurring in the dataset for PA, and 19355 has a 33.33% chance and 19327 has a 16.17% chance of occurring. So 19333 has the highest probability of being picked in trying to fill the three missing ZIPs. The final filled dataset may look like the following where two missing values are filled by '19333' and one was filled by '19355':

       STATE     ZIP      
       PA        19333        
       PA        19327        
       PA        19333        
       PA        19333       
       PA        19355
       PA        19333
       PA        19333
       PA        19355
       PA        19355

I have more than one STATE in my dataset. The main idea is to fill in missing ZIPs based on the probability of a ZIP occurring for a given STATE.

回答1:

Here's a way just using sample, wrapped up in a convenience function.

sample_fill_na = function(x) {
    x_na = is.na(x)
    x[x_na] = sample(x[!x_na], size = sum(x_na), replace = TRUE)
    return(x)
}

dt[, ZIP := sample_fill_na(ZIP), by = STATE]

来源：https://stackoverflow.com/questions/47143523/fill-missing-value-based-on-probability-of-occurrence

标签

data.table

missing-data

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!