Randomly insert NAs into dataframe proportionaly

后端未结

关注

 6  1431

I have a complete dataframe. I want to 20% of the values in the dataframe to be replaced by NAs to simulate random missing data.

A <- c(1:10)
B <- c(1


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  执笔经年        
                
              
                            
                2020-11-29 12:39
              
            
            
                                                                       
If you are in the mood to use purrr instead of lapply, you can also do it like this:

> library(purrr)
> df <- data.frame(A = 1:10, B = 11:20, C = 21:30)
> df
    A  B  C
1   1 11 21
2   2 12 22
3   3 13 23
4   4 14 24
5   5 15 25
6   6 16 26
7   7 17 27
8   8 18 28
9   9 19 29
10 10 20 30
> map_df(df, function(x) {x[sample(c(TRUE, NA), prob = c(0.8, 0.2), size = length(x), replace = TRUE)]})
# A tibble: 10 x 3
       A     B     C
   <int> <int> <int>
1      1    11    21
2      2    12    22
3     NA    13    NA
4      4    14    NA
5      5    15    25
6      6    16    26
7      7    17    27
8      8    NA    28
9      9    19    29
10    10    20    30

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  无人共我        
                
              
                            
                2020-11-29 12:42
              
            
            
                                                                       
A mutate_all approach:
df %>% 
  dplyr::mutate_all(~ifelse(sample(c(TRUE, FALSE), size = length(.), replace = TRUE, prob = c(0.8, 0.2)),
         as.character(.), NA))

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  爱一瞬间的悲伤        
                
              
                            
                2020-11-29 12:45
              
            
            
                                                                       
May i suggest a first function (ggNAadd) designed to do this, and improve it with a second function providing graphical distribution of the NAs created (ggNA)

What is neat is the possibility to input either a proportion of a fixed number of NAs. 

ggNAadd = function(data, amount, plot=F){
  temp <- data
  amount2 <- ifelse(amount<1, round(prod(dim(data))*amount), amount)
  if (amount2 >= prod(dim(data))) stop("exceeded data size")
  for (i in 1:amount2) temp[sample.int(nrow(temp), 1), sample.int(ncol(temp), 1)] <- NA
  if (plot) print(ggNA(temp))
  return(temp)
}


And the plotting function:

ggNA = function(data, alpha=0.5){
  require(ggplot2)
  DF <- data
  if (!is.matrix(data)) DF <- as.matrix(DF)
  to.plot <- cbind.data.frame('y'=rep(1:nrow(DF), each=ncol(DF)), 
                              'x'=as.logical(t(is.na(DF)))*rep(1:ncol(DF), nrow(DF)))
  size <- 20 / log( prod(dim(DF)) )  # size of point depend on size of table
  g <- ggplot(data=to.plot) + aes(x,y) +
    geom_point(size=size, color="red", alpha=alpha) +
    scale_y_reverse() + xlim(1,ncol(DF)) +
    ggtitle("location of NAs in the data frame") +
    xlab("columns") + ylab("lines")
  pc <- round(sum(is.na(DF))/prod(dim(DF))*100, 2) # % NA
  print(paste("percentage of NA data: ", pc))
  return(g)
}


Which gives (using ggplot2 as graphical output):

ggNAadd(df, amount=0.20, plot=TRUE)
## [1] "percentage of NA data:  20"
##     A  B  c
## 1   1 11 21
## 2   2 12 22
## 3   3 13 23
## 4   4 NA 24
## ..




Of course, as mentioned earlier, if you ask too many NAs the actual percentage will drop because of repetitions.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  别那么骄傲        
                
              
                            
                2020-11-29 12:48
              
            
            
                                                                       
Same result, using binomial distribution:

dd=dim(df)
nna=20/100 #overall
df1<-df
df1[matrix(rbinom(prod(dd), size=1,prob=nna)==1,nrow=dd[1])]<-NA
df1

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  离开以前        
                
              
                            
                2020-11-29 12:56
              
            
            
                                                                       
df <- data.frame(A = 1:10, B = 11:20, c = 21:30)
head(df)
##   A  B  c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 14 24
## 5 5 15 25
## 6 6 16 26

as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
##     A  B  c
## 1   1 11 21
## 2   2 12 22
## 3   3 13 23
## 4   4 14 24
## 5   5 NA 25
## 6   6 16 26
## 7  NA 17 27
## 8   8 18 28
## 9   9 19 29
## 10 10 20 30


It's a random process, so it might not give 15% every time.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  渐次进展        
                
              
                            
                2020-11-29 13:03
              
            
            
                                                                       
You can unlist the data.frame and then take a random sample, then put back in a data.frame. 

df <- unlist(df)
n <- length(df) * 0.15
df[sample(df, n)] <- NA
as.data.frame(matrix(df, ncol=3))


It can be done a bunch of different ways using sample().  
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复