Replace missing values (NA) with most recent non-NA by group

前端 未结 7 909
南旧
南旧 2020-11-22 05:42

I would like to solve the following problem with dplyr. Preferable with one of the window-functions. I have a data frame with houses and buying prices. The following is an e

相关标签:
7条回答
  • 2020-11-22 05:58

    These all use na.locf from the zoo package. Also note that na.locf0 (also defined in zoo) is like na.locf except it defaults to na.rm = FALSE and requires a single vector argument. na.locf2 defined in the first solution is also used in some of the others.

    dplyr

    library(dplyr)
    library(zoo)
    
    na.locf2 <- function(x) na.locf(x, na.rm = FALSE)
    df %>% group_by(houseID) %>% do(na.locf2(.)) %>% ungroup
    

    giving:

    Source: local data frame [15 x 3]
    Groups: houseID
    
       houseID year price
    1        1 1995    NA
    2        1 1996   100
    3        1 1997   100
    4        1 1998   120
    5        1 1999   120
    6        2 1995    NA
    7        2 1996    NA
    8        2 1997    NA
    9        2 1998    30
    10       2 1999    30
    11       3 1995    NA
    12       3 1996    44
    13       3 1997    44
    14       3 1998    44
    15       3 1999    44
    

    A variation of this is:

    df %>% group_by(houseID) %>% mutate(price = na.locf0(price)) %>% ungroup
    

    Other solutions below give output which is quite similar so we won't repeat it except where the format differs substantially.

    Another possibility is to combine the by solution (shown further below) with dplyr:

    df %>% by(df$houseID, na.locf2) %>% bind_rows
    

    by

    library(zoo)
    
    do.call(rbind, by(df, df$houseID, na.locf2))
    

    ave

    library(zoo)
    
    transform(df, price = ave(price, houseID, FUN = na.locf0))
    

    data.table

    library(data.table)
    library(zoo)
    
    data.table(df)[, na.locf2(.SD), by = houseID]
    

    zoo This solution uses zoo alone. It returns a wide rather than long result:

    library(zoo)
    
    z <- read.zoo(df, index = 2, split = 1, FUN = identity)
    na.locf2(z)
    

    giving:

           1  2  3
    1995  NA NA NA
    1996 100 NA 44
    1997 100 NA 44
    1998 120 30 44
    1999 120 30 44
    

    This solution could be combined with dplyr like this:

    library(dplyr)
    library(zoo)
    
    df %>% read.zoo(index = 2, split = 1, FUN = identity) %>% na.locf2
    

    input

    Here is the input used for the examples above:

    df <- structure(list(houseID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
      2L, 3L, 3L, 3L, 3L, 3L), year = c(1995L, 1996L, 1997L, 1998L, 
      1999L, 1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L, 1997L, 
      1998L, 1999L), price = c(NA, 100L, NA, 120L, NA, NA, NA, NA, 
      30L, NA, NA, 44L, NA, NA, NA)), .Names = c("houseID", "year", 
      "price"), class = "data.frame", row.names = c(NA, -15L))
    

    REVISED Re-arranged and added more solutions. Revised dplyr/zoo solution to conform to latest changes dplyr. Applied fixed and factored out na.locf2 from all solutions.

    0 讨论(0)
  • 2020-11-22 06:02

    Pure dplyr solution (no zoo).

    df %>% 
     group_by(houseID) %>%
     mutate(price_change = cumsum(0 + !is.na(price))) %>%
     group_by(price_change, add = TRUE) %>%
     mutate(price_filled = nth(price, 1)) %>%
     ungroup() %>%
     select(-price_change) -> df2
    

    Intresting part of example solution is at the end of df2.

    > tail(df2, 20)
    Source: local data frame [20 x 4]
    
        houseID year     price price_filled
     1       14 1995        NA           NA
     2       14 1996        NA           NA
     3       14 1997        NA           NA
     4       14 1998        NA           NA
     5       14 1999 0.8374778    0.8374778
     6       14 2000        NA    0.8374778
     7       14 2001        NA    0.8374778
     8       14 2002        NA    0.8374778
     9       14 2003 2.1918880    2.1918880
    10       14 2004        NA    2.1918880
    11       15 1995        NA           NA
    12       15 1996 0.3982450    0.3982450
    13       15 1997        NA    0.3982450
    14       15 1998 1.7727000    1.7727000
    15       15 1999        NA    1.7727000
    16       15 2000        NA    1.7727000
    17       15 2001        NA    1.7727000
    18       15 2002 7.8636329    7.8636329
    19       15 2003        NA    7.8636329
    20       15 2004        NA    7.8636329
    
    0 讨论(0)
  • 2020-11-22 06:04

    Since data.table v1.12.4, the package has a nafill() funciton, similar to tidyr::fill() or zoo::na.locf() and you can do:

    require(data.table)
    setDT(df)
    
    df[ , price := nafill(price, type = 'locf'), houseID ]
    

    There is also setnafill(), though not allowing for a group by, but multpile columns.

    setnafill(df, type = 'locf', cols = 'price')
    

    Data taken from @G. Grothendieck's answer:

    df = data.frame(houseID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
                                2L, 3L, 3L, 3L, 3L, 3L),
                    year = c(1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L,
                             1997L, 1998L, 1999L, 1995L, 1996L, 1997L, 1998L, 1999L),
                    price = c(NA, 100L, NA, 120L, NA, NA, NA, NA, 30L, NA, NA, 44L,
                              NA, NA, NA))
    
    0 讨论(0)
  • 2020-11-22 06:06

    tidyr::fill now makes this stupidly easy:

    library(dplyr)
    library(tidyr)
    # or library(tidyverse)
    
    df %>% group_by(houseID) %>% fill(price)
    # Source: local data frame [15 x 3]
    # Groups: houseID [3]
    # 
    #    houseID  year price
    #      (int) (int) (int)
    # 1        1  1995    NA
    # 2        1  1996   100
    # 3        1  1997   100
    # 4        1  1998   120
    # 5        1  1999   120
    # 6        2  1995    NA
    # 7        2  1996    NA
    # 8        2  1997    NA
    # 9        2  1998    30
    # 10       2  1999    30
    # 11       3  1995    NA
    # 12       3  1996    44
    # 13       3  1997    44
    # 14       3  1998    44
    # 15       3  1999    44
    
    0 讨论(0)
  • 2020-11-22 06:11

    Without dplyr:

      prices$price <-unlist(lapply(split(prices$price,prices$houseID),
    function(x) zoo::na.locf(x,na.rm=FALSE)))
    
    prices
       houseID year price
    1        1 1995    NA
    2        1 1996   100
    3        1 1997   100
    4        1 1998   120
    5        1 1999   120
    6        2 1995    NA
    7        2 1996    NA
    8        2 1997    NA
    9        2 1998    30
    10       2 1999    30
    11       3 1995    NA
    12       3 1996    44
    13       3 1997    44
    14       3 1998    44
    15       3 1999    44
    
    0 讨论(0)
  • 2020-11-22 06:17

    You can do a rolling self-join, supported by data.table:

    require(data.table)
    setDT(df)   ## change it to data.table in place
    setkey(df, houseID, year)     ## needed for fast join
    df.woNA <- df[!is.na(price)]  ## version without the NA rows
    
    # rolling self-join will return what you want
    df.woNA[df, roll=TRUE]  ## will match previous year if year not found
    
    0 讨论(0)
提交回复
热议问题