问题
edit
The question was originally asked for data.table
. A solution with any package would be interesting.
I am a little stuck with a particular variation of a more general problem. I have panel data that I am using with data.table and I would like to fill in some missing values using the group by functionality of data.table. Unfortunately they are not numeric, so I can't simply interpolate, but they should only be filled in based on a condition. Is it possible to perform a kind of conditional na.locf in data.tables?
Essentially I only want to fill in the NAs if after the NAs the next observation is the previous ones, though the more general question is how to conditionally fill in NAs.
For example, in the following data I would like to fill in the associatedid variable by each id group. So id==1
, year==2003
would fill in as ABC123
because its the value before and after the NA, but not 2000 for the same id. id== 2
would not be changed because the next value is not the same as the one prior to the NAs. id==3
would fill in for 2003 and 2004.
mydf <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), year = c(2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L), associatedid = structure(c(NA, 1L, 1L, NA, 1L, 1L, NA, 1L, 1L, NA, 2L, 2L, NA, 1L, 1L, NA, NA, 1L), .Label = c("ABC123", "DEF456"), class = "factor")), class = "data.frame", row.names = c(NA, -18L))
mydf
#> id year associatedid
#> 1 1 2000 <NA>
#> 2 1 2001 ABC123
#> 3 1 2002 ABC123
#> 4 1 2003 <NA>
#> 5 1 2004 ABC123
#> 6 1 2005 ABC123
#> 7 2 2000 <NA>
#> 8 2 2001 ABC123
#> 9 2 2002 ABC123
#> 10 2 2003 <NA>
#> 11 2 2004 DEF456
#> 12 2 2005 DEF456
#> 13 3 2000 <NA>
#> 14 3 2001 ABC123
#> 15 3 2002 ABC123
#> 16 3 2003 <NA>
#> 17 3 2004 <NA>
#> 18 3 2005 ABC123
dt = data.table(mydf, key = c("id"))
desired output
#> id year associatedid
#> 1 1 2000 <NA>
#> 2 1 2001 ABC123
#> 3 1 2002 ABC123
#> 4 1 2003 ABC123
#> 5 1 2004 ABC123
#> 6 1 2005 ABC123
#> 7 2 2000 <NA>
#> 8 2 2001 ABC123
#> 9 2 2002 ABC123
#> 10 2 2003 <NA>
#> 11 2 2004 DEF456
#> 12 2 2005 DEF456
#> 13 3 2000 <NA>
#> 14 3 2001 ABC123
#> 15 3 2002 ABC123
#> 16 3 2003 ABC123
#> 17 3 2004 ABC123
#> 18 3 2005 ABC123
回答1:
This is all about writing a modified na.locf function. After that you can plug it into data.table like any other function.
new.locf <- function(x){
# might want to think about the end of this loop
# this works here but you might need to add another case
# if there are NA's as the last value.
#
# anyway, loop through observations in a vector, x.
for(i in 2:(length(x)-1)){
nextval = i
# find the next, non-NA value
# again, not tested but might break if there isn't one?
while(nextval <= length(x)-1 & is.na(x[nextval])){
nextval = nextval + 1
}
# if the current value is not NA, great!
if(!is.na(x[i])){
x[i] <- x[i]
}else{
# if the current value is NA, and the last value is a value
# (should given the nature of this loop), and
# the next value, as calculated above, is the same as the last
# value, then give us that value.
if(is.na(x[i]) & !is.na(x[i-1]) & x[i-1] == x[nextval]){
x[i] <- x[nextval]
}else{
# finally, return NA if neither of these conditions hold
x[i] <- NA
}
}
}
# return the new vector
return(x)
}
Once we have that function, we can use data.table as usual:
dt2 <- dt[,list(year = year,
# when I read your data in, associatedid read as factor
associatedid = new.locf(as.character(associatedid))
),
by = "id"
]
This returns:
> dt2
id year associatedid
1: 1 2000 NA
2: 1 2001 ABC123
3: 1 2002 ABC123
4: 1 2003 ABC123
5: 1 2004 ABC123
6: 1 2005 ABC123
7: 2 2000 NA
8: 2 2001 ABC123
9: 2 2002 ABC123
10: 2 2003 NA
11: 2 2004 DEF456
12: 2 2005 DEF456
13: 3 2000 NA
14: 3 2001 ABC123
15: 3 2002 ABC123
16: 3 2003 ABC123
17: 3 2004 ABC123
18: 3 2005 ABC123
which is what you are looking for as best I understand it.
I provided some hedging in the new.locf definition so you still might have a little thinking to do but this should get you started.
回答2:
If na.locf0
applied forward and backwards are the same then use na.locf0
; otherwise, if they are not equal or if either is NA then use NA.
library(data.table)
library(zoo)
dt[, associatedid :=
ifelse(na.locf0(associatedid) == na.locf0(associatedid, fromLast=TRUE),
na.locf0(associatedid), NA), by = id]
giving:
> dt
id year associatedid
1: 1 2000 <NA>
2: 1 2001 ABC123
3: 1 2002 ABC123
4: 1 2003 ABC123
5: 1 2004 ABC123
6: 1 2005 ABC123
7: 2 2000 <NA>
8: 2 2001 ABC123
9: 2 2002 ABC123
10: 2 2003 <NA>
11: 2 2004 DEF456
12: 2 2005 DEF456
13: 3 2000 <NA>
14: 3 2001 ABC123
15: 3 2002 ABC123
16: 3 2003 ABC123
17: 3 2004 ABC123
18: 3 2005 ABC123
回答3:
Here's a pure tidyverse solution :
library(tidyverse)
mydf %>%
mutate(up = associatedid, down = associatedid) %>%
group_by(id) %>%
fill(up,.direction = "up") %>%
fill(down) %>%
mutate_at("associatedid", ~if_else(is.na(.) & up == down, up, .)) %>%
ungroup() %>%
select(-up, - down)
#> # A tibble: 18 x 3
#> id year associatedid
#> <int> <int> <fct>
#> 1 1 2000 <NA>
#> 2 1 2001 ABC123
#> 3 1 2002 ABC123
#> 4 1 2003 ABC123
#> 5 1 2004 ABC123
#> 6 1 2005 ABC123
#> 7 2 2000 <NA>
#> 8 2 2001 ABC123
#> 9 2 2002 ABC123
#> 10 2 2003 <NA>
#> 11 2 2004 DEF456
#> 12 2 2005 DEF456
#> 13 3 2000 <NA>
#> 14 3 2001 ABC123
#> 15 3 2002 ABC123
#> 16 3 2003 ABC123
#> 17 3 2004 ABC123
#> 18 3 2005 ABC123
Or using zoo::na.locf
:
library(dplyr)
library(zoo)
mydf %>%
group_by(id) %>%
mutate_at("associatedid", ~if_else(
is.na(.) & na.locf(.,F) == na.locf(.,F,fromLast = TRUE), na.locf(.,F), .)) %>%
ungroup()
#> # A tibble: 18 x 3
#> id year associatedid
#> <int> <int> <fct>
#> 1 1 2000 <NA>
#> 2 1 2001 ABC123
#> 3 1 2002 ABC123
#> 4 1 2003 ABC123
#> 5 1 2004 ABC123
#> 6 1 2005 ABC123
#> 7 2 2000 <NA>
#> 8 2 2001 ABC123
#> 9 2 2002 ABC123
#> 10 2 2003 <NA>
#> 11 2 2004 DEF456
#> 12 2 2005 DEF456
#> 13 3 2000 <NA>
#> 14 3 2001 ABC123
#> 15 3 2002 ABC123
#> 16 3 2003 ABC123
#> 17 3 2004 ABC123
#> 18 3 2005 ABC123
The same idea but using data.table :
library(zoo)
library(data.table)
setDT(mydf)
mydf[,associatedid := fifelse(
is.na(associatedid) & na.locf(associatedid,F) == na.locf(associatedid,F,fromLast = TRUE),
na.locf(associatedid,F), associatedid),
by = id]
mydf
#> id year associatedid
#> 1: 1 2000 <NA>
#> 2: 1 2001 ABC123
#> 3: 1 2002 ABC123
#> 4: 1 2003 ABC123
#> 5: 1 2004 ABC123
#> 6: 1 2005 ABC123
#> 7: 2 2000 <NA>
#> 8: 2 2001 ABC123
#> 9: 2 2002 ABC123
#> 10: 2 2003 <NA>
#> 11: 2 2004 DEF456
#> 12: 2 2005 DEF456
#> 13: 3 2000 <NA>
#> 14: 3 2001 ABC123
#> 15: 3 2002 ABC123
#> 16: 3 2003 ABC123
#> 17: 3 2004 ABC123
#> 18: 3 2005 ABC123
And finally a fun idea using base, noting that you want to interpolate only if constant interpolation and linear interpolation are the same, if this character variable was numeric :
i <- ave( as.numeric(factor(mydf$associatedid)), mydf$id,FUN = function(x) ifelse(
approx(x,xout = seq_along(x))$y == (z<- approx(x,xout = seq_along(x),method = "constant")$y),
z, x))
mydf$associatedid <- levels(mydf$associatedid)[i]
mydf
#> id year associatedid
#> 1 1 2000 <NA>
#> 2 1 2001 ABC123
#> 3 1 2002 ABC123
#> 4 1 2003 ABC123
#> 5 1 2004 ABC123
#> 6 1 2005 ABC123
#> 7 2 2000 <NA>
#> 8 2 2001 ABC123
#> 9 2 2002 ABC123
#> 10 2 2003 <NA>
#> 11 2 2004 DEF456
#> 12 2 2005 DEF456
#> 13 3 2000 <NA>
#> 14 3 2001 ABC123
#> 15 3 2002 ABC123
#> 16 3 2003 ABC123
#> 17 3 2004 ABC123
#> 18 3 2005 ABC123
回答4:
You could roll forwards and backwards for the missing rows, compare values and assign if they are equal:
library(data.table)
DT = data.table(mydf)
w = DT[is.na(associatedid), which=TRUE]
dn = DT[w, DT[-w][.SD, on=.(id, year), roll=TRUE, x.associatedid]]
up = DT[w, DT[-w][.SD, on=.(id, year), roll=-Inf, x.associatedid]]
ww = na.omit(w[up == dn])
DT[ww, associatedid := dn[ww]]
id year associatedid
1: 1 2000 <NA>
2: 1 2001 ABC123
3: 1 2002 ABC123
4: 1 2003 ABC123
5: 1 2004 ABC123
6: 1 2005 ABC123
7: 2 2000 <NA>
8: 2 2001 ABC123
9: 2 2002 ABC123
10: 2 2003 <NA>
11: 2 2004 DEF456
12: 2 2005 DEF456
13: 3 2000 <NA>
14: 3 2001 ABC123
15: 3 2002 ABC123
16: 3 2003 <NA>
17: 3 2004 <NA>
18: 3 2005 ABC123
回答5:
Here is another attempt with dplyr
:
library(dplyr)
mydf %>%
#Detect NA values in associatedid
mutate(isReplaced = is.na(associatedid), ans = associatedid) %>%
group_by(id) %>%
#Fill all NA values
tidyr::fill(associatedid) %>%
#Detect the NA values which were replaced
mutate(isReplaced = isReplaced & !is.na(associatedid)) %>%
#Group by id and associatedid
group_by(associatedid, add = TRUE) %>%
#Add NA values if it was isReplaced and is first or last row of the group
mutate(ans = replace(associatedid,row_number() %in% c(1, n()) & isReplaced, NA)) %>%
ungroup() %>%
select(-isReplaced, -associatedid)
# A tibble: 18 x 3
# id year ans
# <int> <int> <fct>
# 1 1 2000 NA
# 2 1 2001 ABC123
# 3 1 2002 ABC123
# 4 1 2003 ABC123
# 5 1 2004 ABC123
# 6 1 2005 ABC123
# 7 2 2000 NA
# 8 2 2001 ABC123
# 9 2 2002 ABC123
#10 2 2003 NA
#11 2 2004 DEF456
#12 2 2005 DEF456
#13 3 2000 NA
#14 3 2001 ABC123
#15 3 2002 ABC123
#16 3 2003 ABC123
#17 3 2004 ABC123
#18 3 2005 ABC123
回答6:
I've been trying to put together a two pass approach that on the first pass would change the NA's to paste "p_" in fron of the starting value (within an id) and then with the second pass check that the last of a sequence is in agreement with the next real value. I offer my code so far, which is not really an answer, so not expecting any upvotes. (Probably would have been easier to rename that associatedid
as asid
.)
lapply( split(df, df$id),
function(d){ d$associatedid <- as.character(d$associatedid)
missloc <- with( d, tapply(is.na(associatedid), id, which))
for (n in missloc) if(
d$associatedid[n+1] %in% c(d$associatedid[n-1],
paste0("p_" , d$associatedid[n-1])&
grepl( gsub("p\\_", "", d$associatedid[n-1]), d$associatedid[n+1] )
{ d$associatedid[n] <- d$associatedid[n-1]
} else{
#tentative NA replacement
d$associatedid[n] <- paste0("p_" , d$associatedid[n-1])}
})
来源:https://stackoverflow.com/questions/27368770/conditional-na-filling-by-group