What fast loss convergence indicates on a CNN?

主宰稳场 提交于 2019-12-23 22:20:18

问题


I'm training two CNNs (AlexNet e GoogLeNet) in two differents DL libraries (Caffe e Tensorflow). The networks was implemented by dev teams of each libraries (here and here)

I reduced the original Imagenet dataset to 1024 images of 1 category -- but setted 1000 categories to classify on the networks.

So I trained the CNNs, varying processing unit (CPU/GPU) and batches sizes, and I observed that the losses converges fastly to near zero (in mostly times before 1 epoch be completed), like in this graph (Alexnet on Tensorflow):

In portuguese, 'Épocas' is epochs and 'Perda' is loss. The number in the keys are refering for batch sizes.

The weight decays and initial learning rate are the same as used on models that I downloaded, I only changed the dataset and the batch sizes.

Why my networks are converging this way, and not like this way?


回答1:


The loss function is less noisy than usual and does not oscillate for a few reasons.

The main one is because you have only 1 category, so (to simplify a bit) the network is easily improving at each step, just by improving the score for that category on all your inputs.

Take a look at the (beautiful !) image below: if you have several classes, a good step for one sample is often a bad one for another sample (because they have different categories), which is why the loss goes up locally sometimes. A network update made on a sample of category 1 is a bad step for all samples of cat 2, and conversely, but the sum of the two types of updates goes in the right direction (they compensate their bad parts, only the useful part of the steps remain). If you have 1 class, you'll go straight and fast to "always predict category 1", whereas with 2 or more categories, you'll zigzag and converge slowly to "always predict correctly".

There are a few other effects, like the fact that your dataset is relatively small (so it's easier to learn), that you don't test that often, and maybe you have some smoothing (is your loss computed on the whole dataset or on a batch ? Usually it's on a batch, which participates in the usual loss function graph).

The difference between your curves is also normal, but still characteristic of the fact that you have only 1 class actually present in the dataset. First notice that the CPU and the GPU have the same behavior, because they do exactly he same thing, just at a different speed. When your batch size is >1, the updates in the network that are done are the average of all the updates that you would have done with the samples alone (again simplifying a bit). So usually you'd get smarter updates (more likely togo in the direction of "Always predict correctly"), so you'd need less updates to reach good performances. There is a tradeoff between this faster convergence and the fact that bigger batches use more data for each update, so it's hard to say beforehand which curve should converge faster. It's widely considered that you should use minibatch of size > 1 (but not too big either). Now when you have only 1 class actually present in the dataset, all updates are roughly in the same direction "Always predict 1", so the minibatch average is basically the same, but consumed more data to get roughly the same update. Since you still need the same number of these updates, you'll converge after the same number of steps, so you'll consume more data for the same result.



来源:https://stackoverflow.com/questions/47649786/what-fast-loss-convergence-indicates-on-a-cnn

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!