问题
I have a sample table with some but not all NA
values that need to be replaced.
> dat
id message index
1 1 <NA> 1
2 1 foo 2
3 1 foo 3
4 1 <NA> 4
5 1 foo 5
6 1 <NA> 6
7 2 <NA> 1
8 2 baz 2
9 2 <NA> 3
10 2 baz 4
11 2 baz 5
12 2 baz 6
13 3 bar 1
14 3 <NA> 2
15 3 <NA> 3
16 3 bar 4
17 3 <NA> 5
18 3 bar 6
19 3 <NA> 7
20 3 qux 8
My objective is to replace the NA
values that are surrounded by the same "message" using the first appearance of the message (the least index
value) and the last appearance of the message (using the max index
value) by id
Sometimes, the NA sequences are only of length 1, other times they can be very long. Regardless, all of the NA
's that are "sandwiched" in between the same value of "message" before and after the NA
should be filled in.
The output for the above incomplete table would be:
> output
id message index
1 1 <NA> 1
2 1 foo 2
3 1 foo 3
4 1 foo 4
5 1 foo 5
6 1 <NA> 6
7 2 <NA> 1
8 2 baz 2
9 2 baz 3
10 2 baz 4
11 2 baz 5
12 2 baz 6
13 3 bar 1
14 3 bar 2
15 3 bar 3
16 3 bar 4
17 3 bar 5
18 3 bar 6
19 3 <NA> 7
20 3 qux 8
Any guidance using data.table
or dplyr
here would be helpful as I'm not even sure where to begin.
As far as I could get was subsetting by unique messages but this method does not take into account id
:
#get distinct messages
messages = unique(dat$message)
#remove NA
messages = messages[!is.na(messages)]
#subset dat for each message
for (i in 1:length(messages)) {print(dat[dat$message == messages[i],]) }
the data:
dput(dat)
structure(list(id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3,
3, 3, 3, 3, 3, 3, 3), message = c(NA, "foo", "foo", NA, "foo",
NA, NA, "baz", NA, "baz", "baz", "baz", "bar", NA, NA, "bar",
NA, "bar", NA, "qux"), index = c(1, 2, 3, 4, 5, 6, 1, 2, 3, 4,
5, 6, 1, 2, 3, 4, 5, 6, 7, 8)), row.names = c(NA, -20L), class = "data.frame")
回答1:
Perform an na.locf0
both fowards and backwards and if they are the same then use the common value; otherwise, use NA. The grouping is done with ave
.
library(zoo)
filler <- function(x) {
forward <- na.locf0(x)
backward <- na.locf0(x, fromLast = TRUE)
ifelse(forward == backward, forward, NA)
}
transform(dat, message = ave(message, id, FUN = filler))
giving:
id message index
1 1 <NA> 1
2 1 foo 2
3 1 foo 3
4 1 foo 4
5 1 foo 5
6 1 <NA> 6
7 2 <NA> 1
8 2 baz 2
9 2 baz 3
10 2 baz 4
11 2 baz 5
12 2 baz 6
13 3 bar 1
14 3 bar 2
15 3 bar 3
16 3 bar 4
17 3 bar 5
18 3 bar 6
19 3 <NA> 7
20 3 qux 8
回答2:
An option that uses na.approx
from zoo
.
First, we extract the unique elements from column message
that are not NA
and find there positions in dat$message
x <- unique(na.omit(dat$message))
(y <- match(dat$message, x))
# [1] NA 1 1 NA 1 NA NA 2 NA 2 2 2 3 NA NA 3 NA 3 NA 4
library(zoo)
library(dplyr)
out <- do.call(coalesce,
lapply(seq_along(x), function(i) as.double(na.approx(match(y, i) * i, na.rm = FALSE))))
dat$new <- x[out]
dat
# id message index new
#1 1 <NA> 1 <NA>
#2 1 foo 2 foo
#3 1 foo 3 foo
#4 1 <NA> 4 foo
#5 1 foo 5 foo
#6 1 <NA> 6 <NA>
#7 2 <NA> 1 <NA>
#8 2 baz 2 baz
#9 2 <NA> 3 baz
#10 2 baz 4 baz
#11 2 baz 5 baz
#12 2 baz 6 baz
#13 3 bar 1 bar
#14 3 <NA> 2 bar
#15 3 <NA> 3 bar
#16 3 bar 4 bar
#17 3 <NA> 5 bar
#18 3 bar 6 bar
#19 3 <NA> 7 <NA>
#20 3 qux 8 qux
tl;dr
When we call
match(y, 1) * 1
# [1] NA 1 1 NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
we get the elements only where there are 1
s in y
. Accordingly, when we do
match(y, 2) * 2
# [1] NA NA NA NA NA NA NA 2 NA 2 2 2 NA NA NA NA NA NA NA NA
the result is the same for the 2
s.
Think of 1
and 2
as of the first and second elements in
x
# [1] "foo" "baz" "bar" "qux"
that is "foo"
and "baz"
.
Now for each match(y, i) * i
we can call na.approx
from zoo
to fill the NA
s that are in between (i
will become seq_along(x)
later).
na.approx(match(y, 2) * 2, na.rm = FALSE)
# [1] NA NA NA NA NA NA NA 2 2 2 2 2 NA NA NA NA NA NA NA NA
We do the same for each element in seq_along(x)
, that is 1:4
using lapply
. The result is a list
lapply(seq_along(x), function(i) as.double(na.approx(match(y, i) * i, na.rm = FALSE)))
#[[1]]
# [1] NA 1 1 1 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#
#[[2]]
# [1] NA NA NA NA NA NA NA 2 2 2 2 2 NA NA NA NA NA NA NA NA
#
#[[3]]
# [1] NA NA NA NA NA NA NA NA NA NA NA NA 3 3 3 3 3 3 NA NA
#
#[[4]]
# [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 4
(as.double
was needed here because else coalesce
would complain that "Argument 4 must be type double, not integer")
We are almost there. What we need to do next is to find the first non-missing value at each position, this is where coalesce
from dplyr
comes into play and the result is
out <- do.call(coalesce,
lapply(seq_along(x), function(i) as.integer(na.approx(match(y, i) * i, na.rm = FALSE))))
out
# [1] NA 1 1 1 1 NA NA 2 2 2 2 2 3 3 3 3 3 3 NA 4
We can use this vector to extract the desired values from x
as
x[out]
# [1] NA "foo" "foo" "foo" "foo" NA NA "baz" "baz" "baz" "baz" "baz" "bar" "bar" "bar" "bar" "bar" "bar" NA "qux"
Hope this helps.
回答3:
Here's an approach without grouping to fill the values and then replace back with NA
if they were filled incorrectly.
tidyr::fill
by default fills missing values with the previous value, so it will overfill some values. Unfortunately it doesn't respect grouping so we have to use an if_else
condition to fix its errors.
First, we capture the original missing value locations and calculate the max and min index
for each id
and message
. After filling, we join on these index
boundaries. If there is not a match, then the id
changed; if there is a match either it was a correct replacement or the index
is outside the boundaries. So we check in the locations with original missing values for these conditions and replace back with NA
if they are met.
EDIT: this can be broken on other input, attempting to fix
library(tidyverse)
dat <- structure(list(id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3), message = c(NA, "foo", "foo", NA, "foo", NA, NA, "baz", NA, "baz", "baz", "baz", "bar", NA, NA, "bar", NA, "bar", NA, "qux"), index = c(1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 8)), row.names = c(NA, -20L), class = "data.frame")
indices <- dat %>%
group_by(id, message) %>%
summarise(min = min(index), max = max(index)) %>%
drop_na
dat %>%
mutate(orig_na = is.na(message)) %>%
fill(message) %>%
left_join(indices, by = c("id", "message")) %>%
mutate(
message = if_else(
condition = orig_na &
(index < min | index > max | is.na(min)),
true = NA_character_,
false = message
)
)
#> id message index orig_na min max
#> 1 1 <NA> 1 TRUE NA NA
#> 2 1 foo 2 FALSE 2 5
#> 3 1 foo 3 FALSE 2 5
#> 4 1 foo 4 TRUE 2 5
#> 5 1 foo 5 FALSE 2 5
#> 6 1 <NA> 6 TRUE 2 5
#> 7 2 <NA> 1 TRUE NA NA
#> 8 2 baz 2 FALSE 2 6
#> 9 2 baz 3 TRUE 2 6
#> 10 2 baz 4 FALSE 2 6
#> 11 2 baz 5 FALSE 2 6
#> 12 2 baz 6 FALSE 2 6
#> 13 3 bar 1 FALSE 1 6
#> 14 3 bar 2 TRUE 1 6
#> 15 3 bar 3 TRUE 1 6
#> 16 3 bar 4 FALSE 1 6
#> 17 3 bar 5 TRUE 1 6
#> 18 3 bar 6 FALSE 1 6
#> 19 3 <NA> 7 TRUE 1 6
#> 20 3 qux 8 FALSE 8 8
Created on 2019-02-15 by the reprex package (v0.2.1)
回答4:
If you fill both ways and check for equality that should work, as long as you account for grouping and index:
tidyverse:
library(tidyverse)
dat %>%
arrange(id, index) %>%
mutate(msg_down = fill(group_by(., id), message, .direction = 'down')$message,
msg_up = fill(group_by(., id), message, .direction = 'up')$message,
message = case_when(!is.na(message) ~ message,
msg_down == msg_up ~ msg_down,
TRUE ~ NA_character_)) %>%
select(-msg_down, -msg_up)
id message index
1 1 <NA> 1
2 1 foo 2
3 1 foo 3
4 1 foo 4
5 1 foo 5
6 1 <NA> 6
7 2 <NA> 1
8 2 baz 2
9 2 baz 3
10 2 baz 4
11 2 baz 5
12 2 baz 6
13 3 bar 1
14 3 bar 2
15 3 bar 3
16 3 bar 4
17 3 bar 5
18 3 bar 6
19 3 <NA> 7
20 3 qux 8
data.table
library(data.table)
library(zoo)
setDT(dat)[order(index),
message := ifelse(na.locf(message, na.rm = FALSE) == na.locf(message, na.rm = FALSE, fromLast = TRUE),
na.locf(message, na.rm = FALSE),
NA),
by = "id"][]
id message index
1: 1 <NA> 1
2: 1 foo 2
3: 1 foo 3
4: 1 foo 4
5: 1 foo 5
6: 1 <NA> 6
7: 2 <NA> 1
8: 2 baz 2
9: 2 baz 3
10: 2 baz 4
11: 2 baz 5
12: 2 baz 6
13: 3 bar 1
14: 3 bar 2
15: 3 bar 3
16: 3 bar 4
17: 3 bar 5
18: 3 bar 6
19: 3 <NA> 7
20: 3 qux 8
回答5:
Another tidyverse solution using case_when. Edited to avoid filling after end of series.
library(dplyr)
dfr <- data.frame(
index = c(1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 8),
message = c(NA, "foo", "foo", NA, "foo", NA, NA, "baz", NA, "baz", "baz", "baz", "bar", NA, NA, "bar", NA, "bar", NA, "qux"),
id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3)
)
dfrFilled <- dfr %>%
group_by(id) %>%
mutate(
endSeries = max( # identify end of series
index[message == na.omit(message)[1]],
na.rm = T
),
filledValues = case_when(
min(index) == index ~ message,
max(index) == index ~ message,
index < endSeries ~ na.omit(message)[1], # fill if index is before end of series.
TRUE ~ message
)
)
来源:https://stackoverflow.com/questions/54717876/replace-na-when-last-and-next-non-na-values-are-equal