问题
I have the following data frame (simplified) with the country variable as a factor and the value variable has missing values:
country value
AUT NA
AUT 5
AUT NA
AUT NA
GER NA
GER NA
GER 7
GER NA
GER NA
The following generates the above data frame:
data <- data.frame(country=c("AUT", "AUT", "AUT", "AUT", "GER", "GER", "GER", "GER", "GER"), value=c(NA, 5, NA, NA, NA, NA, 7, NA, NA))
Now, I would like to replace the NA values in each country subset using the method last observation carried forward (LOCF). I know the command na.locf
in the zoo package. data <- na.locf(data)
would give me the following data frame:
country value
AUT NA
AUT 5
AUT 5
AUT 5
GER 5
GER 5
GER 7
GER 7
GER 7
However, the function should only be used on the individual subsets split by the country. The following is the output I would need:
country value
AUT NA
AUT 5
AUT 5
AUT 5
GER NA
GER NA
GER 7
GER 7
GER 7
I can't think of an easy way to implement it. Before starting with for-loops, I was wondering if anyone has any idea as to how to solve this.
Many thanks!!
回答1:
Here's a ddply
solution. Try this
library(plyr)
ddply(DF, .(country), na.locf)
country value
1 AUT <NA>
2 AUT 5
3 AUT 5
4 AUT 5
5 GER <NA>
6 GER <NA>
7 GER 7
8 GER 7
9 GER 7
Edit
From ddply
help you can find that
.variables: variables to split data frame by,
as quoted variables, a formula or character vector.
so another alternatives to get what you want are:
ddply(DF, "country", na.locf)
ddply(DF, ~country, na.locf)
note that replacing .variables
with DF$variable
is not allowed, that's why you got an error when doing this.
DF
is your data.frame
回答2:
A modern version of the ddply
solution is to use the package dplyr
:
library(dplyr)
DF %>%
group_by(county) %>%
mutate(value = na.locf(value, na.rm = F))
回答3:
The tidyverse way, albeit not using locf, is:
library(tidyverse)
data %>%
group_by(country) %>%
fill(value)
Source: local data frame [9 x 2]
Groups: country [2]
country value
(fctr) (dbl)
1 AUT NA
2 AUT 5
3 AUT 5
4 AUT 5
5 GER NA
6 GER NA
7 GER 7
8 GER 7
9 GER 7
回答4:
Split the data.frame
with by
and use na.locf
on the subsets:
do.call(rbind,by(data,data$country,na.locf))
If you would like to remove the row names:
do.call(rbind,unname(by(data,data$country,na.locf)))
回答5:
You simply need to split by country, then a do either a zoo::na.locf() or na.fill, filling to the right. Here is an example explicitly showing the three-component arg syntax of na.fill:
library(plyr)
library(zoo)
data <- data.frame(country=c("AUT", "AUT", "AUT", "AUT", "GER", "GER", "GER", "GER", "GER"), value=c(NA, 5, NA, NA, NA, NA, 7, NA, NA))
# The following is equivalent to na.locf
na.fill.right <- function(...) { na.fill(..., list(left=NA,interior=NA,right="extend")) }
ddply(data, .(country), na.fill.right)
country value
1 AUT <NA>
2 AUT 5
3 AUT 5
4 AUT 5
5 GER <NA>
6 GER <NA>
7 GER 7
8 GER 7
9 GER 7
回答6:
If speed is a consideration then this unstack
/stack
solution is about 4 to 6 times faster than the others on my system although it does entail a slightly longer line of code:
stack(lapply(unstack(data, value ~ country), na.locf, na.rm = FALSE))
Another approach is:
transform(data, value = ave(value, country, FUN = na.locf0))
回答7:
I'm a little late to this conversation, but here is a data.table
way, which will be much faster for larger data sets:
library(zoo)
library(data.table)
# Convert to data table
setDT(data)
data[, value := na.locf(value, na.rm = FALSE), by = country]
data
country value
1: AUT NA
2: AUT 5
3: AUT 5
4: AUT 5
5: GER NA
6: GER NA
7: GER 7
8: GER 7
9: GER 7
# And if you want to convert "data" back to a data frame...
setDF(data)
回答8:
A combination of the packages dplyr and imputeTS can do the job.
library(dplyr)
library(imputeTS)
data %>% group_by(country) %>%
mutate(value = na.locf(value, na.remaining="keep"))
With the na.remaining parameter of the na.locf function of imputeTS you have additionally the option to choose, what to do with the trailing NAs.
These are the options:
- "keep" - return the series with NAs
- "rm" - remove remaining NAs
- "mean" - replace remaining NAs by overall mean
- "rev" - perform nocb / locf from the reverse direction
By choosing "mean" you would for example get a result with 7 for every GER in the specific example.
来源:https://stackoverflow.com/questions/13616965/how-to-fill-nas-with-locf-by-factors-in-data-frame-split-by-country