问题
This is a sample of my data set:
day city count
1 1 A 50
2 2 A 100
3 2 B 110
4 2 C 90
Here is the code for reproducing it:
df <- data.frame(
day = c(1,2,2,2),
city = c("A","A","B","C"),
count = c(50,100,110,90)
)
As you could see, the count data is missing for city B and C on the day 1. What I want to do is to use city A's count as an estimate for the other two cities. So the desired output would be:
day city count
1 1 A 50
2 1 B 50
3 1 C 50
4 2 A 100
5 2 B 110
6 2 C 90
I could come up with a for loop to do it, but I feel like there should be an easier way of doing it. My idea is to count the number of observations for each day, and then for the days that the number of observations is less than the number of cities in the data set, I would replicate the row to complete the data for that day. Any better ideas? or a more efficient for-loop? Thanks.
回答1:
With dplyr
and tidyr
, we can do:
library(dplyr)
library(tidyr)
df %>%
expand(day, city) %>%
left_join(df) %>%
group_by(day) %>%
fill(count, .direction = "up") %>%
fill(count, .direction = "down")
Alternatively, we can avoid the left_join
using thelatemail's solution:
df %>%
complete(day, city) %>%
group_by(day) %>%
fill(count, .direction = "up") %>%
fill(count, .direction = "down")
Both return:
# A tibble: 6 x 3
day city count
<dbl> <fct> <dbl>
1 1. A 50.
2 1. B 50.
3 1. C 50.
4 2. A 100.
5 2. B 110.
6 2. C 90.
Data (slightly modified to show .direction
filling both directions):
df <- data.frame(
day = c(1,2,2,2),
city = c("B","A","B","C"),
count = c(50,100,110,90)
)
来源:https://stackoverflow.com/questions/49184893/conditionally-duplicating-rows-in-a-data-frame