问题
I want to forecast day-ahead power consumption using recurrent neural networks (RNN). But, I find the required data format (samples, timesteps, features) for RNN as confusing. Let me explain with an example as:
I have power_dataset.csv on dropbox, which contains power consumption from 5 June to 18 June at 10 minutely rate (144 observations per day). Now, to check the performance of RNN using rnn R
package, I am following these steps
- train model
M
for the usage of 17 June by using data from 5-16 June - predict usage of 18 June by using
M
and updated usage from 6-17 June
My understanding of RNN data format is:
Samples: No. of samples or observations.
timesteps: No. of steps when the pattern repeats. In my case, 144 observations occur in a day, so every consecutive 144 observations constitute timesteps. In other words, it defines seasonality period.
features: No. of features, which is one in my case, i.e., the consumption time-series of historical days
Accordingly, my script is as:
library(rnn)
df <- read.csv("power_dataset.csv")
train <- df[1:2016,] # train set from 5-16 June
test <- df[145:dim(df)[1],] # test set from 6-18 June
# prepare data to train a model
trainX <- train[1:1872,]$power # using only power column now
trainY <- train[1873:dim(train)[1],]$power
# data formatting acc. to rnn as [samples, timesteps, features]
tx <- array(trainX,dim=c(NROW(trainX),144,1))
ty <- array(trainY,dim=c(NROW(trainY),144,1))
model <- trainr(X=tx,Y=ty,learningrate = 0.04, hidden_dim = 10, numepochs = 100)
Error output is:
The sample dimension of X is different from the sample dimension of Y.
The error is generated due to wrong data formatting. How can I format data correctly?
回答1:
A few points:
You need to have same # of samples in the input
X
and outputY
in the training data to start with, in the above implementation you are having 1872 samples forX
and 144 samples forY
. Moreover, your training arraytx
contains same column replicated 144 times, which does not make much sense.We can think of training a
RNN
orLSTM
model in a few following ways: In the figure below Model1 tries to capture recurring patterns across the 10 minute time intervals where Model2 tries to capture the recurring pattern across the (previous) days.
# Model1
window <- 144
train <- df[1:(13*window),]$power
tx <- t(sapply(1:13, function(x) train[((x-1)*window+1):(x*window)]))
ty <- tx[2:13,]
tx <- tx[-nrow(tx),]
tx <- array(tx,dim=c(NROW(tx),NCOL(tx),1))
ty <- array(trainY,dim=c(NROW(ty),NCOL(ty),1))
model <- trainr(X=tx,Y=ty,learningrate = 0.01, hidden_dim = 10, numepochs = 100)
test <- sapply(2:13, function(x) train[((x-1)*window+1):(x*window)])
pred <- predictr(model,X=array(test,dim=c(NROW(test),NCOL(test),1)))
# Model2
window <- 144
train <- df[1:(13*window),]$power
tx <- sapply(1:12, function(x) train[((x-1)*window+1):(x*window)])
ty <- train[(12*window+1):(13*window)]
tx <- array(tx,dim=c(NROW(tx),NCOL(tx),1))
ty <- array(trainY,dim=c(NROW(ty),1,1))
model <- trainr(X=tx,Y=ty,learningrate = 0.01, hidden_dim = 10, numepochs = 100, seq_to_seq_unsync=TRUE)
test <- sapply(2:13, function(x) train[((x-1)*window+1):(x*window)])
pred <- predictr(model,X=array(test,dim=c(NROW(test),NCOL(test),1)))
- Your data is too small to train an RNN or a LSTM, compared to the feature size. That's why both the models trained are very very poor and unusable. You can try to collect more data and learn the models and then use them for prediction.
回答2:
it suffices to change "seq-to-seq-unsync=TRUE" Hope useful.
来源:https://stackoverflow.com/questions/42431720/format-time-series-data-for-short-term-forecasting-using-recurrent-neural-networ