问题
I am new to R but have turned to it to solve a problem with a large data set I am trying to process. Currently I have a 4 columns of data (Y values) set against minute-interval timestamps (month/day/year hour:min) (X values) as below:
timestamp tr tt sr st
1 9/1/01 0:00 1.018269e+02 -312.8622 -1959.393 4959.828
2 9/1/01 0:01 1.023567e+02 -313.0002 -1957.755 4958.935
3 9/1/01 0:02 1.018857e+02 -313.9406 -1956.799 4959.938
4 9/1/01 0:03 1.025463e+02 -310.9261 -1957.347 4961.095
5 9/1/01 0:04 1.010228e+02 -311.5469 -1957.786 4959.078
The problem I have is that some timestamp values are missing - e.g. there may be a gap between 9/1/01 0:13 and 9/1/01 0:27 and such gaps are irregular through the data set. I need to put several of these series into the same database and because the missing values are different for each series, the dates do not currently align on each row.
I would like to generate rows for these missing timestamps and fill the Y columns with blank values (no data, not zero), so that I have a continuous time series.
I\'m honestly not quite sure where to start (not really used R before so learning as I go along!) but any help would be much appreciated. I have thus far installed chron and zoo, since it seems they might be useful.
Thanks!
回答1:
I think the easiest thing ist to set Date first as already described, convert to zoo, and then just set a merge:
df$timestamp<-as.POSIXct(df$timestamp,format="%m/%d/%y %H:%M")
df1.zoo<-zoo(df[,-1],df[,1]) #set date to Index
df2 <- merge(df1.zoo,zoo(,seq(start(df1.zoo),end(df1.zoo),by="min")), all=TRUE)
Start and end are given from your df1 (original data) and you are setting by - e.g min - as you need for your example. all=TRUE sets all missing values at the missing dates to NAs.
回答2:
This is an old question, but I just wanted to post a dplyr way of handling this, as I came across this post while searching for an answer to a similar problem. I find it more intuitive and easier on the eyes than the zoo approach.
library(dplyr)
ts <- seq.POSIXt(as.POSIXct("2001-09-01 0:00",'%m/%d/%y %H:%M'), as.POSIXct("2001-09-01 0:07",'%m/%d/%y %H:%M'), by="min")
ts <- seq.POSIXt(as.POSIXlt("2001-09-01 0:00"), as.POSIXlt("2001-09-01 0:07"), by="min")
ts <- format.POSIXct(ts,'%m/%d/%y %H:%M')
df <- data.frame(timestamp=ts)
data_with_missing_times <- full_join(df,original_data)
timestamp tr tt sr st
1 09/01/01 00:00 15 15 78 42
2 09/01/01 00:01 20 64 98 87
3 09/01/01 00:02 31 84 23 35
4 09/01/01 00:03 21 63 54 20
5 09/01/01 00:04 15 23 36 15
6 09/01/01 00:05 NA NA NA NA
7 09/01/01 00:06 NA NA NA NA
8 09/01/01 00:07 NA NA NA NA
Also using dplyr, this makes it easier to do something like change all those missing values to something else, which came in handy for me when plotting in ggplot.
data_with_missing_times %>% group_by(timestamp) %>% mutate_each(funs(ifelse(is.na(.),0,.)))
timestamp tr tt sr st
1 09/01/01 00:00 15 15 78 42
2 09/01/01 00:01 20 64 98 87
3 09/01/01 00:02 31 84 23 35
4 09/01/01 00:03 21 63 54 20
5 09/01/01 00:04 15 23 36 15
6 09/01/01 00:05 0 0 0 0
7 09/01/01 00:06 0 0 0 0
8 09/01/01 00:07 0 0 0 0
回答3:
Date padding is implemented in the padr
package in R. If you store your data frame, with your date-time variable stored as POSIXct
or POSIXlt
. All you need to do is:
library(padr)
pad(df_name)
See vignette("padr") or this blog post for its working.
回答4:
# some made-up data
originaldf <- data.frame(timestamp=c("9/1/01 0:00","9/1/01 0:01","9/1/01 0:03","9/1/01 0:04"),
tr = rnorm(4,0,1),
tt = rnorm(4,0,1))
originaldf$minAsPOSIX <- as.POSIXct(originaldf$timestamp, format="%m/%d/%y %H:%M", tz="GMT")
# Generate vector of all minutes
ndays <- 1 # number of days to generate
minAsNumeric <- 60*60*24*243 + seq(0,60*60*24*ndays,by=60)
# convert those minutes to POSIX
minAsPOSIX <- as.POSIXct(minAsNumeric, origin="2001-01-01", tz="GMT")
# new df
newdf <- merge(data.frame(minAsPOSIX),originaldf,all.x=TRUE, by="minAsPOSIX")
回答5:
I think this can accomplished by using complete
in tidyr
package.
library(tidyverse)
df <- df %>%
complete(timestamp = seq.POSIXt(min(timestamp), max(timestamp), by = "minute"),
tr, tt, sr,st)
you can also initialize your start date and end date instead of using min(timestamp)
and max(timestamp)
.
回答6:
In case you want to substitute the NA values acquired by any method mentioned above with zeroes, you can do this:
df[is.na(df)] <- 0
(I orginally wanted to comment this on Ibollar's answer but I lack the necessary reputation, thus I posted as an answer)
回答7:
df1.zoo <- zoo(df1[,-1], as.POSIXlt(df1[,1], format = "%Y-%m-%d %H:%M:%S")) #set date to Index: Notice that column 1 is Timestamp type and is named as "TS"
full.frame.zoo <- zoo(NA, seq(start(df1.zoo), end(df1.zoo), by="min")) # zoo object
full.frame.df <- data.frame(TS = as.POSIXlt(index(full.frame.zoo), format = "%Y-%m-%d %H:%M:%S")) # conver zoo object to data frame
full.vancouver <- merge(full.frame.df, df1, all = TRUE) # merge
回答8:
I was looking for something similar where instead of filling out missing timestamps my data was in months and days. So I wanted to generate a sequence of months that would cater for leap years et cetera. I used lubridate
:
date <- df$timestamp[1]
date_list <- c(date)
while (date < df$timestamp[nrow(df)]){
date <- date %m+% months(1)
date_list <- c(date_list,date)
}
date_list <- format(as.Date(date_list),"%Y-%m-%d")
df_1 <- data.frame(months=date_list, stringsAsFactors = F)
This will give me a list of dates in incremental months. Then I join
df_with_missing_months <- full_join(df_1,df)
回答9:
There are some advances in handling time series data in R, e.g. the tsibble
package added such time series manipulations in tidy way:
library(tsibble)
library(lubridate)
ts <- lubridate::dmy_hm(c("9/1/01 0:00","9/1/01 0:01","9/1/01 0:03","9/1/01 0:27"))
originaldf <- tsibble(timestamp = ts,
tr = rnorm(4,0,1),
tt = rnorm(4,0,1),
index = timestamp)
originaldf %>%
fill_gaps()
来源:https://stackoverflow.com/questions/16787038/insert-rows-for-missing-dates-times