问题
There are edits to this post at the end.
I have a large dataset of daily dietary records for a population of individuals. There are data missing at random from each of the individuals. This is an example for one individual (I will eventually generalize this solution to the population):
> str(final_daily)
'data.frame': 387 obs. of 10 variables:
$ Date : chr "2014-08-13" "2014-08-14" "2014-08-15" "2014-08-16" ...
$ MEID.1 : Factor w/ 97 levels "","1","1.1","1.1a",..: NA NA NA 17 24 NA NA NA NA NA ...
$ MEID.2 : Factor w/ 184 levels "1","100","100.1",..: NA NA NA 143 48 NA NA NA NA NA ...
$ MEID.3 : Factor w/ 180 levels "100","100.1",..: NA NA NA 24 134 NA NA NA NA NA ...
$ MEID.4 : Factor w/ 42 levels "173","173a","173b",..: NA NA NA 17 1 NA NA NA NA NA ...
$ MEID.5 : Factor w/ 3 levels "d1","s1","s2": NA NA NA 2 3 NA NA NA NA NA ...
$ MEID.6 : Factor w/ 1 level "s2": NA NA NA NA NA NA NA NA NA NA ...
$ DAYT : int NA NA NA 1 8 NA NA NA NA NA ...
$ DATT : int NA NA NA 1 1 NA NA NA NA NA ...
$ Reason.For.Change: chr "0" "0" "0" "0" ...
I am aware of the implementations that can be used to fill in missing data such as last observation carried forward (LOCF) and next observation carried backwards (NOCB). Importantly, the missing data gaps can exist for as few as a single date to up to months of days at a time.
I would like to create an imputation method that uses LOCF for the first half of the missing time period and NOCB for the second half of the missing time period. This is more important for large time series gaps (I don't want to use dietary intake on February 28 to be representative for August 1 when August 2 is available). Can anyone suggest a possible solution here?
Importantly, I also have a column (Reason.For.Change) which should constrain the imputation methods as in Filling in missing (blanks) in a data table, per category - backwards and forwards. For example, when Reason.For.Change has a value >0, the imputation should recognize this. In other words, Reason.For.Change values >0 denote "different" time series within an individual that starts on the day where Reason.For.Change is >0, and these time series must be imputed separately.
Essentially, this column creates two conditions: when a record is not available the date prior to a date where Reason.For.Change is >0, only LOCF can be used. Second, since a record of diet intake is not available on the same date that Reason.For.Change is >0, only NOCB can be used. (This second example is analagous to the example in Filling in missing (blanks) in a data table, per category - backwards and forwards where patients are missing 'doctor' on their first visit.)
Any advice/direction is appreciated to accomplish the following which I summarize below
- Imputation method for time series gaps that includes LOCF and NOCB for the first and last 50% of the gap
- Imputation method in 1) that acknowledges breaks in the time series denoted by values >0 on a date and allows for LOCF up-to the 'break-date' and NOCB filling back to and including the break-date
[Edit] After thinking some more, the implementations in R -- Carry last observation forward n times and Fill NA in a time series only to a limited number seem to offer a step in the direction of addressing 1) here in my question. However, I would like to generalize their use of LOCF n-times to LOCF for length(missing data)/2 ...
[Edit 2] After thinking even more, I have added a new column in my dataframe, GAP_DAYS, which counts the number of days in the missing time period (gap). Here is str() on the data after the new column was added.
> str(final_daily_intake2)
'data.frame': 387 obs. of 11 variables:
$ Date : chr "2014-08-13" "2014-08-14" "2014-08-15" "2014-08-16" ...
$ MEID.1 : chr NA NA NA "14" ...
$ MEID.2 : Factor w/ 184 levels "1","100","100.1",..: NA NA NA 143 48 NA NA NA NA NA ...
$ MEID.3 : Factor w/ 180 levels "100","100.1",..: NA NA NA 24 134 NA NA NA NA NA ...
$ MEID.4 : Factor w/ 42 levels "173","173a","173b",..: NA NA NA 17 1 NA NA NA NA NA ...
$ MEID.5 : Factor w/ 3 levels "d1","s1","s2": NA NA NA 2 3 NA NA NA NA NA ...
$ MEID.6 : Factor w/ 1 level "s2": NA NA NA NA NA NA NA NA NA NA ...
$ DAYT : int NA NA NA 1 8 NA NA NA NA NA ...
$ DATT : int NA NA NA 1 1 NA NA NA NA NA ...
$ Reason.For.Change: chr "0" "0" "0" "0" ...
$ GAP_Days : chr "1" "2" "3" "NA" ...
I was thinking that this could be used to determine the n
number of days to use LOCF on, for each gap period. For example, in the first missing data time period, there are 3 days missing (hence 1, 2, 3, in the str() for GAP_Days). In this example, since it is an odd number of days, I would like LOCF to use the result of round(3 * 0.5) to obtain a value of 2, which would be used as input to LOCF. In a longer time period, for example, where the length of GAP_Days is 30, LOCF would use the result of round(30 * 0.5) such that LOCF would be used for 15 days.
I think this approach can be used to go over the dataframe once with LOCF, and then a second time with NOCB. (Although I still haven't addressed the need to acknowledge breaks in the time series denoted by Reason.For.Change). Much thanks.
回答1:
Since the text is very long I'll point out the questions again:
Imputation method for time series gaps that includes LOCF and NOCB for the first and last 50% of the gap
Imputation method in 1) that acknowledges breaks in the time series denoted by values >0 on a date and allows for LOCF up-to the 'break-date' and NOCB filling back to and including the break-date
As far as I know, there is no package available for R, which directly enables you to do one of this tasks.
To 1): There are quite a bunch of packages which contain a locf option:
- imputeTS::na.locf()
- zoo::na.locf()
- xts::na.locf()
- spacetime::na.locf()
Indeed your idea for imputation makes pretty much sense. But none of the packages has a option for your requested behavior. What you can do with e.g. zoo is set the maxgap parameter. Runs of more than maxgap NAs are then retained. Which means you can/must treat them separately afterwards. You would have to program your requested behavior on your own.
Another idea could be using other more advanced function of these packages, that make use of both sides of the NA gaps.
An example would be imputeTS::na.ma() which imputes the values with an moving average (you can set the window size).
There are also even more advanced functions like
- imputeTS::na.kalman()
- imputeTS::na.interpolation()
- forecast::na.interp()
- zoo::na.StructTS()
These also take into account saisonal behavior (weekday patterns) and trend and other things. Problem with these is of course they are not as easy reasonable as the simple algorithms like locf or ma.
To 2): There is also no premade function for this. This would also have to be coded individually.
来源:https://stackoverflow.com/questions/33808890/fill-in-time-series-gaps-with-both-lcof-and-nocb-methods-but-acknowledge-breaks