Merge or append multiple Keras TimeseriesGenerator objects into one

问题

I'm trying to make a LSTM model. The data is coming from a csv file that contains values for multiple stocks.

I can't use all the rows as they appear in the file to make sequences because each sequence is only relevant in the context of its own stock, so I need to select the rows for each stock and make the sequences based on that.

I have something like this:

for stock in stocks:

    stock_df = df.loc[(df['symbol'] == stock)].copy()
    target = stock_df.pop('price')

    x = np.array(stock_df.values)
    y = np.array(target.values)

    sequence = TimeseriesGenerator(x, y, length = 4, sampling_rate = 1, batch_size = 1)

That works fine, but then I want to merge each of those sequences into a bigger one that I will use for training and that contains the data for all the stocks.

It is not possible to use append or merge because the function return a generator object, not a numpy array.

回答1:

EDIT: New answer:

So what I've ended up doing is to do all the preprocessing manually and save an .npy file for each stock containing the preprocessed sequences, then using a manually created generator I make batches like this:

class seq_generator():

  def __init__(self, list_of_filepaths):
    self.usedDict = dict()
    for path in list_of_filepaths:
      self.usedDict[path] = []

  def generate(self):
    while True: 
      path = np.random.choice(list(self.usedDict.keys()))
      stock_array = np.load(path) 
      random_sequence = np.random.randint(stock_array.shape[0])
      if random_sequence not in self.usedDict[path]:
        self.usedDict[path].append(random_sequence)
        yield stock_array[random_sequence, :, :]

train_generator = seq_generator(list_of_filepaths)

train_dataset = tf.data.Dataset.from_generator(seq_generator.generate(),
                                               output_types=(tf.float32, tf.float32), 
                                               output_shapes=(n_timesteps, n_features)) 

train_dataset = train_dataset.batch(batch_size)

Where list_of_filepaths is simply a list of paths to preprocessed .npy data.

This will:

Load a random stock's preprocessed .npy data
Pick a sequence at random
Check if the index of the sequence has already been used in usedDict
If not:
- Append the index of that sequence to usedDict to keep track as to not feed the same data twice to the model
- Yield the sequence

This means that the generator will feed a single unique sequence from a random stock at each "call", enabling me to use the .from_generator() and .batch() methods from Tensorflows Dataset type.

Original answer:

I think the answer from @TF_Support is slightly missing the point. If I understand your question It's not as if you want to train one model pr. stock, you want one model trained on the entire dataset.

If you have enough memory you could manually create the sequences and hold the entire dataset in memory. The issue I'm facing is similar, I simply can't hold everything in memory: Creating a TimeseriesGenerator with multiple inputs.

Instead I'm exploring the possibility of preprocessing all data for each stock seperately, saving as .npy files and then using a generator to load a random sample of those .npy files to batch data to the model, I'm not entirely sure how to approach this yet though.

回答2:

For the scenario, you want to merge each of those sequences into a bigger one that I will use for training and that contains the data for all the stocks.

You can append the created TimeSeriesGenerators into a Python List.

stock_timegenerators = []
for stock in stocks:
    stock_df = stock.copy()
    features = stock_df.pop('symbol')
    target = stock_df.pop('price')

    x = np.array(stock_df.values)
    y = np.array(target.values)

    # sequence = TimeseriesGenerator(x, y, length = 4, sampling_rate = 1, batch_size = 1)
    stock_timegenerators.append(TimeseriesGenerator(x, y, length = 4, sampling_rate = 1, batch_size = 1))

The output of this will be an appended TimeSeriesGenerator that you can use by iterating the list or reference by index.

[<tensorflow.python.keras.preprocessing.sequence.TimeseriesGenerator at 0x7eff62c699b0>,
 <tensorflow.python.keras.preprocessing.sequence.TimeseriesGenerator at 0x7eff62c6eba8>,
 <tensorflow.python.keras.preprocessing.sequence.TimeseriesGenerator at 0x7eff62c782e8>]

Also having Multiple Keras Timeseries means that you're training Multiple LSTM Models for each stock.
You can also use this approach in dealing with multiple models efficiently.

lstm_models = []
for time_series_gen in stock_timegenerators:

    # lstm_models.append(create_model()) : You could create everything using functions

    # Or in the loop like this.
    model = Sequential()
    model.add(LSTM(32, input_shape = (n_input, n_features)))
    model.add(Dense(1))

    model.compile(loss ='mse', optimizer = 'adam')

    model.fit(time_series_gen, steps_per_epoch= 1, epochs = 5)

    lstm_models.append(model)

This would output a list of models appended and easily referenced using the index.

[<tensorflow.python.keras.engine.sequential.Sequential at 0x7eff62c7b748>,
 <tensorflow.python.keras.engine.sequential.Sequential at 0x7eff6100e160>,
 <tensorflow.python.keras.engine.sequential.Sequential at 0x7eff63dc94a8>]

This way you can create Multiple LSTM Models that have different Time Series Generators for different stocks.

Hope this helps you.

来源：https://stackoverflow.com/questions/61155779/merge-or-append-multiple-keras-timeseriesgenerator-objects-into-one

标签

python

tensorflow

keras

lstm