Randomly insert NAs into dataframe proportionaly

后端 未结 6 1429
無奈伤痛
無奈伤痛 2020-11-29 12:07

I have a complete dataframe. I want to 20% of the values in the dataframe to be replaced by NAs to simulate random missing data.

A <- c(1:10)
B <- c(1         


        
相关标签:
6条回答
  • 2020-11-29 12:39

    If you are in the mood to use purrr instead of lapply, you can also do it like this:

    > library(purrr)
    > df <- data.frame(A = 1:10, B = 11:20, C = 21:30)
    > df
        A  B  C
    1   1 11 21
    2   2 12 22
    3   3 13 23
    4   4 14 24
    5   5 15 25
    6   6 16 26
    7   7 17 27
    8   8 18 28
    9   9 19 29
    10 10 20 30
    > map_df(df, function(x) {x[sample(c(TRUE, NA), prob = c(0.8, 0.2), size = length(x), replace = TRUE)]})
    # A tibble: 10 x 3
           A     B     C
       <int> <int> <int>
    1      1    11    21
    2      2    12    22
    3     NA    13    NA
    4      4    14    NA
    5      5    15    25
    6      6    16    26
    7      7    17    27
    8      8    NA    28
    9      9    19    29
    10    10    20    30
    
    0 讨论(0)
  • 2020-11-29 12:42

    A mutate_all approach:

    df %>% 
      dplyr::mutate_all(~ifelse(sample(c(TRUE, FALSE), size = length(.), replace = TRUE, prob = c(0.8, 0.2)),
             as.character(.), NA))
    
    0 讨论(0)
  • 2020-11-29 12:45

    May i suggest a first function (ggNAadd) designed to do this, and improve it with a second function providing graphical distribution of the NAs created (ggNA)

    What is neat is the possibility to input either a proportion of a fixed number of NAs.

    ggNAadd = function(data, amount, plot=F){
      temp <- data
      amount2 <- ifelse(amount<1, round(prod(dim(data))*amount), amount)
      if (amount2 >= prod(dim(data))) stop("exceeded data size")
      for (i in 1:amount2) temp[sample.int(nrow(temp), 1), sample.int(ncol(temp), 1)] <- NA
      if (plot) print(ggNA(temp))
      return(temp)
    }
    

    And the plotting function:

    ggNA = function(data, alpha=0.5){
      require(ggplot2)
      DF <- data
      if (!is.matrix(data)) DF <- as.matrix(DF)
      to.plot <- cbind.data.frame('y'=rep(1:nrow(DF), each=ncol(DF)), 
                                  'x'=as.logical(t(is.na(DF)))*rep(1:ncol(DF), nrow(DF)))
      size <- 20 / log( prod(dim(DF)) )  # size of point depend on size of table
      g <- ggplot(data=to.plot) + aes(x,y) +
        geom_point(size=size, color="red", alpha=alpha) +
        scale_y_reverse() + xlim(1,ncol(DF)) +
        ggtitle("location of NAs in the data frame") +
        xlab("columns") + ylab("lines")
      pc <- round(sum(is.na(DF))/prod(dim(DF))*100, 2) # % NA
      print(paste("percentage of NA data: ", pc))
      return(g)
    }
    

    Which gives (using ggplot2 as graphical output):

    ggNAadd(df, amount=0.20, plot=TRUE)
    ## [1] "percentage of NA data:  20"
    ##     A  B  c
    ## 1   1 11 21
    ## 2   2 12 22
    ## 3   3 13 23
    ## 4   4 NA 24
    ## ..
    

    enter image description here

    Of course, as mentioned earlier, if you ask too many NAs the actual percentage will drop because of repetitions.

    0 讨论(0)
  • 2020-11-29 12:48

    Same result, using binomial distribution:

    dd=dim(df)
    nna=20/100 #overall
    df1<-df
    df1[matrix(rbinom(prod(dd), size=1,prob=nna)==1,nrow=dd[1])]<-NA
    df1
    
    0 讨论(0)
  • 2020-11-29 12:56
    df <- data.frame(A = 1:10, B = 11:20, c = 21:30)
    head(df)
    ##   A  B  c
    ## 1 1 11 21
    ## 2 2 12 22
    ## 3 3 13 23
    ## 4 4 14 24
    ## 5 5 15 25
    ## 6 6 16 26
    
    as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
    ##     A  B  c
    ## 1   1 11 21
    ## 2   2 12 22
    ## 3   3 13 23
    ## 4   4 14 24
    ## 5   5 NA 25
    ## 6   6 16 26
    ## 7  NA 17 27
    ## 8   8 18 28
    ## 9   9 19 29
    ## 10 10 20 30
    

    It's a random process, so it might not give 15% every time.

    0 讨论(0)
  • 2020-11-29 13:03

    You can unlist the data.frame and then take a random sample, then put back in a data.frame.

    df <- unlist(df)
    n <- length(df) * 0.15
    df[sample(df, n)] <- NA
    as.data.frame(matrix(df, ncol=3))
    

    It can be done a bunch of different ways using sample().

    0 讨论(0)
提交回复
热议问题