For making the training of a model faster, it seems to be a good practice to populate/generate batches on CPU and run the training of the model on GPU in parallel. For this purp
I have a proposed "improved" solution that may interest others. Please note this is coming from my experience with Tensorflow 1.15 (I have yet to use version 2).
Install wsl
version 2 on Windows, install Tensorflow in a Linux environment (e.g. Ubuntu) here, and then set use_multiprocessing
to True
to get this to work.
NOTE: The Windows Subshell for Linux (WSL) version 2 is only available in Windows 10, Version 1903, Build 18362 or higher. Be sure to upgrade your Windows version in Windows Update to get this to work.
See Install Tensorflow-GPU on WSL2
For multitasking
and multithreading
(i.e. parallelism
and concurrency
), there are two operations we must consider:
forking
= a parent process creates a copy of itself (a child) that has an exact copy of all the memory segments it usesspawning
= a parent process creates an entirely new child process that does not share its memory and the parent process must wait for the child process to finish before continuingLinux supports forking
, but Windows does not. Windows only supports spawning
.
The reason Windows hangs when using use_multiprocessing=True
is because the Python threading
module uses spawn
for Windows. Hence, the parent process waits forever for the child to finish because the parent cannot transfer its memory to the child, so the child doesn't know what to do.
Answer 2: It is not threadsafe
. On Windows, if you've ever attempted to use a data generator or sequence, you've probably seen an error like this
ValueError: Using a generator with use_multiprocessing=True is not supported on Windows
(no marshalling of generators across process boundaries). Instead, use single
thread/process or multithreading.
marshalling
means "transforming the memory representation of an object into a data format that is suitable for transmission." The error is saying that unlike Linux, which uses fork
, use_multiprocessing=True
doesn't work on Windows because it uses spawn` and cannot transfer its data to the child thread.
At this point, you may be asking yourself:
"Wait...What about the Python Global Interpreter Lock (GIL)?..If Python only allows one thread to run at a time, why does it even have the threading
module and why do we care about this in Tensorflow??!"
The answer lies in the difference between CPU-bound tasks
and I/O-bound tasks
:
CPU-bound tasks
= those that are waiting for data to be crunchedI/O-bound tasks
= those that are waiting for input or output from other processes (i.e. data transferring)In programming, when we say two tasks are concurrent
, we mean they can start, run, and complete in overlapping time. When we say they are parallel
, we mean they are literally running at the same time.
So, the GIL prevents threads from running in parallel, but not concurrently. The reason this is important for Tensorflow is because concurrency is all about I/O operations (data transfer). A good dataflow pipeline in Tensorflow should try to be concurrent
so that there's no lag time when data is being transferred to-and-from the CPU, GPU, and/or RAM and training finishes faster. (Rather than have a thread sit and wait until it gets data back from somewhere else, we can have it executing image preprocessing or something else until the data gets back.)
GIL
was made in Python because everything in Python is an object. (This is why you can do "weird" things with "dunder/magic" methods, like (5).__add__(3)
to get 8
NOTE: In the above, parentheses are needed around 5
since 5.
is a float
, so we need to take advantage of order of operations by using parentheses.
Python handles memory and garbage collection by counting all references made to individual objects. When the count goes to 0, Python deletes the object. If two threads tried to access the same object simultaneously, or if one thread finishes faster than another, you can get a race condition
and objects would be deleted "randomly". We could put a lock
on each thread, but then we would be unable to prevent deadlocks
.
Losing parallel
thread execution was seen by Guido (and by myself, though it is certainly arguable) as a minor loss because we still maintained I/O concurrent operations, and tasks could still be run in parallel
by running them on different cpu cores (i.e. multiprocessing
). Hence, this is (one reason) why Python has both the threading
and multiprocessing
modules.
Now back to threadsafe
. When running concurrent
/parallel
tasks, you have to watch out for additional things. Two big ones are:
race conditions
- operations don't take exactly the same time to compute each time a program as run (why with timeit
we average over a number of runs). Because threads will finish at different times depending on the run, you get different results with each run.
deadlock
- if two threads try to access the same memory at the same time, you'll get an error. To prevent this, we add a lock
or mutex
(mutual exclusion) to threads to prevent other threads from accessing the same memory while it is running. However, if two threads need to access the same memory, are locked, and each thread depends on the other one finishing in order to execute, the program hangs.
I bring this up because Tensorflow needs to be able to pickle
Python objects to make code run faster. (pickling
is turning objects and data into byte code, much in the same way that an entire program's source code is converted into an exe
on Windows). The Tensorflow Iterator.__init__()
method locks threads and contains a threading.Lock()
def __init__(self, n, batch_size, shuffle, seed):
...
self.lock = threading.Lock()
...
The problem is Python cannot pickle
threading lock objects on Windows (i.e. Windows cannot marshall
thread locks to child
threads).
If you were to try to use a generator and pass it to fit_generator
, you will get the error (see GitHub Issue #10842
TypeError: can't pickle _thread.lock objects
So, while use_multiprocessing=True
is threadsafe on Linux, it is not on Windows.
Solution: Around June 2020, Microsoft came out with version 2 of the Windows Subshell for Linux (wsl
). This was significant because it enabled GPU hardware acceleration. Version 1 was "simply" a driver between Windows NT and Linux, whereas wsl
is now actually a kernel. Thus, you can now install Linux on Windows, open a bash shell from the command prompt, and (most importantly) access hardware. Thus, it is now possible to install tensorflow-gpu
on wsl
. In addition, you'll now be able to use fork
.
**Thus, I recommend
wsl
version 2 on Windows and add your desired Linux environmenttensorflow-gpu
in a virtual environment in wsl
Linux environment hereuse_multiprocessing=True
to see if it works.**CAVEAT: I haven't tested this yet to verify that it works, but to the best of my limited knowledge, I believe it should.
After this, answering Question 3 should be a simple matter of tuning the amount of concurrency with the amount of parallelism, and I recommend the TensorflowDev 2018 Summit video Training Performance: A user’s guide to converge faster to see how to do that.
Among those who have seen this post, no one seems to have the ultimate answer so that I wanted to give my answer that worked out for me. Because of lack of documentation in the domain, my answer might be missing some relevant details. Please feel free to add more information that I do not mention down here.
Seemingly, writing a generator class in Python that inherits the Sequence
class is just not supported in Windows. (You can seemingly make it work on Linux.) To be able to make it work, you need to set the parameter use_multiprocessing=True
(with the class approach). But it is not working on Windows as mentioned so that you have to set use_multiprocessing
to False
(on Windows).
Nevertheless, that does not mean that multiprocessing does not work on Windows. Even if you set use_multiprocessing=False
, multiprocessing can still be supported when the code is run with the following setup where you just set the workers
parameter to any value that is bigger than 1.
Example:
history = \
merged_model.fit_generator(generator=train_generator,
steps_per_epoch=trainset_steps_per_epoch,
epochs=300,
verbose=1,
use_multiprocessing=False,
workers=3,
max_queue_size=4)
At this point, let's remember the Keras documentation again:
The use of keras.utils.Sequence guarantees the ordering and guarantees the single use of every input per epoch when using use_multiprocessing=True.
To my understanding, if use_multiprocessing=False
, then the generator is not thread safe anymore, which makes it difficult to write a generator class that inherits Sequence
.
To come around this problem, I have written a generator myself which I have made thread safe manually. Here is an example pseudocode:
import tensorflow as tf
import threading
class threadsafe_iter:
"""Takes an iterator/generator and makes it thread-safe by
serializing call to the `next` method of given iterator/generator.
"""
def __init__(self, it):
self.it = it
self.lock = threading.Lock()
def __iter__(self):
return self
def __next__(self): # Py3
return next(self.it)
#def next(self): # Python2 only
# with self.lock:
# return self.it.next()
def threadsafe_generator(f):
"""A decorator that takes a generator function and makes it thread-safe.
"""
def g(*a, **kw):
return threadsafe_iter(f(*a, **kw))
return g
@threadsafe_generator
def generate_data(tfrecord_file_path_list, ...):
dataset = tf.data.TFRecordDataset(tfrecord_file_path_list)
# example proto decode
def _parse_function(example_proto):
...
return batch_data
# Parse the record into tensors.
dataset = dataset.map(_parse_function)
dataset = dataset.shuffle(buffer_size=100000)
# Repeat the input indefinitly
dataset = dataset.repeat()
# Generate batches
dataset = dataset.batch(batch_size)
# Create an initializable iterator
iterator = dataset.make_initializable_iterator()
# Get batch data
batch_data = iterator.get_next()
iterator_init_op = iterator.make_initializer(dataset)
with tf.Session() as sess:
sess.run(iterator_init_op)
while True:
try:
batch_data = sess.run(batch_data)
except tf.errors.OutOfRangeError:
break
yield batch_data
Well, it can be discussed if it is really elegant to do it in this way but it seems to be working pretty well.
To summarize:
use_multiprocessing
to False
.Sequence
when writing code on Windows. (It is a Tensorflow/Keras problem I guess).workers
to a number that is greater than 1.Important note: In this setup, the generator is being run on CPU and the training is being done on GPU. One problem I could observe is that if the model you are training is shallow enough, the utilization of GPU remains very low while CPU utilization gets high. If the model is shallow and the dataset is small enough, it can be a good option to store all the data in the memory and run everything on GPU. It should speed up the training significantly. If, for any reason, you would like to use CPU and GPU simultaneously, my modest recommendation is to try to use Tensorflow's tf.data API which significantly speeds up the data preprocessing and batch preparation. If the generator is only written in Python, GPU keeps waiting for data to continue with the training. One can say everything about the Tensorflow/Keras documentation, but it is really efficient code!
Anyone having more complete knowledge on the API and seeing this post, please feel free to correct me here in case I misunerstand anything or the API is updated to solve the problems even on Windows.