I\'m training the Keras object detection model linked at the bottom of this question, although I believe my problem has to do neither with Keras nor with the specific model I\'m
I've figured it out myself:
TL;DR:
Make sure your loss magnitude is independent of your mini-batch size.
The long explanation:
In my case the issue was Keras-specific after all.
Maybe the solution to this problem will be useful for someone at some point.
It turns out that Keras divides the loss by the mini-batch size. The important thing to understand here is that it's not the loss function itself that averages over the batch size, but rather the averaging happens somewhere else in the training process.
Why does this matter?
The model I am training, SSD, uses a rather complicated multi-task loss function that does its own averaging (not by the batch size, but by the number of ground truth bounding boxes in the batch). Now if the loss function already divides the loss by some number that is correlated with the batch size, and afterwards Keras divides by the batch size a second time, then all of a sudden the magnitude of the loss value starts to depend on the batch size (to be precise, it becomes inversely proportional to the batch size).
Now usually the number of samples in your dataset is not an integer multiple of the batch size you choose, so the very last mini-batch of an epoch (here I implicitly define an epoch as one full pass over the dataset) will end up containing fewer samples than the batch size. This is what messes up the magnitude of the loss if it depends on the batch size, and in turn messes up the magnitude of gradient. Since I'm using an optimizer with momentum, that messed up gradient continues influencing the gradients of a few subsequent training steps, too.
Once I adjusted the loss function by multiplying the loss by the batch size (thus reverting Keras' subsequent division by the batch size), everything was fine: No more spikes in the loss.
I would add gradient clipping because this prevents spikes in the gradients to mess up the parameters during training.
Gradient Clipping is a technique to prevent exploding gradients in very deep networks, typically Recurrent Neural Networks.
Most programs allows you to add a gradient clipping parameter to your GD based optimizer.
For anyone working in PyTorch, an easy solution which solves this specific problem is to specify in the DataLoader
to drop the last batch:
train_loader = torch.utils.data.DataLoader(train_set, batch_size=batch_size, shuffle=False,
pin_memory=(torch.cuda.is_available()),
num_workers=num_workers, drop_last=True)