问题
I'm working on a sequence forecasting problem and I don't have much experience in this area, so some of the below questions might be naive.
FYI: I've created a follow-up question with a focus on CRFs here
I have the following problem:
I would like to forecast a binary sequence for multiple, non-independent variables.
Inputs:
I have a dataset with the following variables:
- Timestamps
- Groups A and B
- Binary signal corresponding to each group at a particular timestamp
Additionally, suppose the following:
- We can extract additional attributes from the timestamps (e.g. hour of day) which can be used as external predictors
- We believe that groups A and B are not independent therefore it might be optimal to model their behaviour jointly
binary_signal_group_A
and binary_signal_group_B
are the 2 non-independent variables that I would like to forecast using (1) their past behaviour and (2) additional information extracted from each timestamp.
What I've done so far:
# required libraries
import re
import numpy as np
import pandas as pd
from keras import Sequential
from keras.layers import LSTM
data_length = 18 # how long our data series will be
shift_length = 3 # how long of a sequence do we want
df = (pd.DataFrame # create a sample dataframe
.from_records(np.random.randint(2, size=[data_length, 3]))
.rename(columns={0:'a', 1:'b', 2:'extra'}))
# NOTE: the 'extra' variable refers to a generic predictor such as for example 'is_weekend' indicator, it doesn't really matter what it is
# shift so that our sequences are in rows (assuming data is sorted already)
colrange = df.columns
shift_range = [_ for _ in range(-shift_length, shift_length+1) if _ != 0]
for c in colrange:
for s in shift_range:
if not (c == 'extra' and s > 0):
charge = 'next' if s > 0 else 'last' # 'next' variables is what we want to predict
formatted_s = '{0:02d}'.format(abs(s))
new_var = '{var}_{charge}_{n}'.format(var=c, charge=charge, n=formatted_s)
df[new_var] = df[c].shift(s)
# drop unnecessary variables and trim missings generated by the shift operation
df.dropna(axis=0, inplace=True)
df.drop(colrange, axis=1, inplace=True)
df = df.astype(int)
df.head() # check it out
# a_last_03 a_last_02 ... extra_last_02 extra_last_01
# 3 0 1 ... 0 1
# 4 1 0 ... 0 0
# 5 0 1 ... 1 0
# 6 0 0 ... 0 1
# 7 0 0 ... 1 0
# [5 rows x 15 columns]
# separate predictors and response
response_df_dict = {}
for g in ['a','b']:
response_df_dict[g] = df[[c for c in df.columns if 'next' in c and g in c]]
# reformat for LSTM
# the response for every row is a matrix with depth of 2 (the number of groups) and width = shift_length
# the predictors are of the same dimensions except the depth is not 2 but the number of predictors that we have
response_array_list = []
col_prefix = set([re.sub('_\d+$','',c) for c in df.columns if 'next' not in c])
for c in col_prefix:
current_array = df[[z for z in df.columns if z.startswith(c)]].values
response_array_list.append(current_array)
# reshape into samples (1), time stamps (2) and channels/variables (0)
response_array = np.array([response_df_dict['a'].values,response_df_dict['b'].values])
response_array = np.reshape(response_array, (response_array.shape[1], response_array.shape[2], response_array.shape[0]))
predictor_array = np.array(response_array_list)
predictor_array = np.reshape(predictor_array, (predictor_array.shape[1], predictor_array.shape[2], predictor_array.shape[0]))
# feed into the model
model = Sequential()
model.add(LSTM(8, input_shape=(predictor_array.shape[1],predictor_array.shape[2]), return_sequences=True)) # the number of neurons here can be anything
model.add(LSTM(2, return_sequences=True)) # should I use an activation function here? the number of neurons here must be equal to the # of groups we are predicting
model.summary()
# _________________________________________________________________
# Layer (type) Output Shape Param #
# =================================================================
# lstm_62 (LSTM) (None, 3, 8) 384
# _________________________________________________________________
# lstm_63 (LSTM) (None, 3, 2) 88
# =================================================================
# Total params: 472
# Trainable params: 472
# Non-trainable params: 0
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # is it valid to use crossentropy and accuracy as metric?
model.fit(predictor_array, response_array, epochs=10, batch_size=1)
model_preds = model.predict_classes(predictor_array) # not gonna worry about train/test split here
model_preds.shape # should return (12, 3, 2) or (# of records, # of timestamps, # of groups which are a and b)
# (12, 3)
model_preds
# array([[1, 0, 0],
# [0, 0, 0],
# [1, 0, 0],
# [0, 0, 0],
# [1, 0, 0],
# [0, 0, 0],
# [0, 0, 0],
# [0, 0, 0],
# [0, 0, 0],
# [0, 0, 0],
# [1, 0, 0],
# [0, 0, 0]])
Questions:
The main question here is this: how do I get this working so that the model would forecast the next N sequences for both groups?
Additionally, I would like to ask the following questions:
- Groups A and B are expected to be cross-correlated, however, is it valid to attempt to output both A and B sequences by a single model or should I fit 2 separate models, one predicting A, the other one predicting B but both using historical A and B data as inputs?
- While my last layer in the model is an LSTM of shape (None, 3, 2), the prediction output is of shape (12, 3) when I would have expected it to be (12, 2) -- am I doing something wrong here and if so, how would I fix this?
- As far as the output LSTM layer is concerned, would it be a good idea to use an activation function here, such as sigmoid? Why/why not?
- Is it valid to use a classification type loss (binary cross-entropy) and metrics (accuracy) for optimising a sequence?
- Is an LSTM model an optimal choice here? Does anyone think that a CRF or some HMM-type model would work better here?
Many thanks!
回答1:
I will answer all question sequentially
how do I get this working so that the model would forecast the next N sequences for both groups?
I would suggest two modifications to your model.
The first is using sigmoid activation for the last layer.
Why?? Consider binary cross entropy loss function (I borrowed the equation from here)
Where
L
is calculated loss, p
is network prediction and y
is target values.
The Loss is defined for
. If p is outside of this open interval range then the loss is undefined. The default activation of lstm layer in keras is tanh and it's output range is (-1, 1). This implies that the output of model is not suitable for binary cross entropy loss. If you try to train the model you might end up gettingnan
for loss.
The second modification (is part of the first modification) either add sigmoid activation before the last layer. For this you have three options.
- Add dense layer with sigmoid activation between your output and last lstm layer.
- Or change the activation of the lstm layer to sigmoid.
- Or add Activation layer with sigmoid activation before the output layer.
Even though all cases would work, I would suggest using dense layer with sigmoid activation because it almost always works better. Now the model with suggested changes would be
model = Sequential()
model.add(LSTM(8, input_shape=(predictor_array.shape[1],predictor_array.shape[2]), return_sequences=True))
model.add(LSTM(2, return_sequences=True))
model.add(TimeDistributed(Dense(2, activation="sigmoid")))
model.summary()
... is it valid to attempt to output both A and B sequences by a single model or should I fit 2 separate models ... ?
Ideally both cases could work. But latest studies such a this one shows that the former case(where you use single model for both groups) tends to perform better. The approach is generally called as Multi Task Learning. The idea behind Multi Task learning is very broad, for simplicity but it can be thought as adding inductive bias by forcing the model to learn hidden representations that are common for multiple tasks.
... the prediction output is of shape (12, 3) when I would have expected it to be (12, 2) -- am I doing something wrong here ... ?
You are getting this because you are using predict_classes method. Unlike predict method predict_classes method returns the maximum index of channels' axis(in your case third index). As I explained above if you use sigmoid activation for last layer and replaced predict_classes with predict, you will get what you are expecting.
As far as the output LSTM layer is concerned, would it be a good idea to use an activation function here, such as sigmoid? Why/why not?
I hope I've explained this above. The answer is YES.
Is it valid to use a classification type loss (binary cross-entropy) and metrics (accuracy) for optimising a sequence?
Since your targets are binary signals(the distribution is Bernoulli distribution), Yes it is valid to use binary loss and accuracy metrics. This answer gives more details on why binary cross entropy is valid for this type of targets variables.
Is an LSTM model an optimal choice here? Does anyone think that a CRF or some HMM-type model would work better here?
This depends on the data available and complexity of the network you choose. CRF and HMM networks are simple and work better if the available data is small. But if available dataset is large, LSTM will almost always outperform both CRF and HMM. My suggestion is if you have a lot of data use LSTM. But if either you have small data or looking for simple models you can use CRF or HMM.
来源:https://stackoverflow.com/questions/53977695/multivariate-binary-sequence-prediction-with-lstm