I am using tf.Tensor
and tf.concat()
to handle large training data,
and I found continuous using of tf.concat()
get
Though there is not a single way of creating a tensor, the answer of the questions lies to what is done with the tensors created.
tensors are immutable, therefore each time, tf.concat
is called a new tensor is created.
let x = tf.tensor1d([2]);
console.log(tf.memory()) // "numTensors": 1
const y = tf.tensor1d([3])
x = tf.concat([x, y])
console.log(tf.memory()) // "numTensors": 3,
<html>
<head>
<!-- Load TensorFlow.js -->
<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@0.14.1"> </script>
</head>
<body>
</body>
</html>
As we can see from the snippet above, the number of tensors that is created when tf.concat is called is 3 and not 2 . It is true that tf.tidy
will dispose of unused tensors. But this operation of creating and disposing of tensors will become most and most costly as the created tensor is getting bigger and bigger. This is both an issue of memory consumption and computation since creating a new tensor will always delegate to a backend.
Now that the issue of performance is understood, what is the best way to proceed ?
for (i= 0; i < data.length; i++) {
// fill array x
x.push(dataValue)
}
// create the tensor
tf.tensor(x)
Though, it is the trivial solution, it is not always possible. Because create an array will keep data in memory and we can easily run out of memory with big data entries. Therefore sometimes, it might be best instead of creating the whole javascript array to create chunk of arrays and create a tensor from those chunk of arrays and start to process those tensors as soon as they are created. The chunk tensors can be merged using tf.concat
again if necessary. But it might not always be required.
For instance we can call model.fit() repeatedly using chunk of tensors instead of calling it once with a big tensor that might take long to create. In this case, there is no need to concatenate the chunk tensors.
function makeIterator() {
const iterator = {
next: () => {
let result;
if (index < data.length) {
result = {value: dataValue, done: false};
index++;
return result;
}
return {value: dataValue, done: true};
}
};
return iterator;
}
const ds = tf.data.generator(makeIterator);
The advantage of using tf.data is that the whole dataset is created by batches when needed during model.fit
call.
While the tf.concat
and Array.push
function look and behave similar there is one big difference:
tf.concat
creates a new tensor from the inputArray.push
adds the input to the first arraytf.concat
const a = tf.tensor1d([1, 2]);
const b = tf.tensor1d([3]);
const c = tf.concat([a, b]);
a.print(); // Result: Tensor [1, 2]
b.print(); // Result: Tensor [3]
c.print(); // Result: Tensor [1, 2, 3]
The resulting variable c
is a new Tensor while a
and b
are not changed.
Array.push
const a = [1,2];
a.push(3);
console.log(a); // Result: [1,2,3]
Here, the variable a
is directly changed.
For the runtime speed, this means that tf.concat
copies all tensor values to a new tensor before adding the input. This obviously takes more time the bigger the array is that needs to be copied. In contrast to that, Array.push
does not create a copy of the array and therefore the runtime will be more or less the same no matter how big the array is.
Note, that this is "by design" as tensors are immutable, so every operation on an existing tensor always creates a new tensor. Quote from the docs:
Tensors are immutable, so all operations always return new Tensors and never modify input Tensors.
Therefore, if you need to create a large tensor from input data it is advisable to first read all data from your file and merge it with "vanilla" JavaScript functions before creating a tensor from it.
In case you have a dataset so big that you need to handle it in chunks because of memory restrictions, you have two options:
The trainOnBatch function allows to train on a batch of data instead of using the full dataset to it. Therefore, you can split your code into reasonable batches before training them, so you don't have to merge your data together all at once.
The other answer already went over the basics. This will allow you to use a JavaScript generator function to prepare the data. I recommend to use the generator syntax instead of an iterator factory (used in the other answer) as it is the more modern JavaScript syntax.
Exampe (taken from the docs):
function* dataGenerator() {
const numElements = 10;
let index = 0;
while (index < numElements) {
const x = index;
index++;
yield x;
}
}
const ds = tf.data.generator(dataGenerator);
You can then use the fitDataset function to train your model.