问题
I'm using yolo v3 model with keras and this network is giving me as output container with shape like this:
[(1, 13, 13, 255), (1, 26, 26, 255), (1, 52, 52, 255)]
So I found this link
Then I understand the value 255 in each of the 3 containers, I also understand that there is 3 containers because there is 3 different image scaling for bounding boxes creation.
But I did not understand why in the output vector there are 13 * 13 lists for the first scaling rate then 26 *26 lists for the second then 52 * 52 for the last.
I can't manage to find some good explanations about that so I can't use this network. If someone knows where I can find some information about the output dimension I would be very greatful.
EDIT
Is it because if i cut the image in 13 by 13 sections i'm only able to detect 13*13 objects considering that each sections are the center of an object ?
回答1:
YOLOv3 have 3 output layers. This output layers predict box coordinates at 3 different scales. YOLOv3 also operates at such way that divide image to grid of cells. Base on which output layer you look the number of cells is different.
So number of outputs is right, 3 lists(because of three output layers). You must consider that YOLOv3 is fully convolutional which means that output layers are width x height x filters. Look at first shape (1, 13, 13, 255) . You understand that 255 stand for bounding box coordinates & classes and confidence, 1 stands for batch size. You now undrestand that output is conv2d so problematic parts are 13 x 13. 13 x 13 means that your input image will be divide into the grid and for each cell of the grid will be predicted bounding box coordinates, classes probabilities etc. Second layer operates at different scale and your image will be divided to grid 26 x 26, third one will divide your image to grid 52 x 52 and also for every cell at the grid will be predicted bounding boxes coordinates.
Why it is useful? From practical point of view, imagine picture where are many little pigeons concentrated at some place. When you have only one 13 x 13 output layer all this pigeons can be present at one grid, so you don't detect them one by one because of this. But if you divide your image to 52 x 52 grid, your cells will be small and there is higher chance that you detect them all. Detection of small objects was complaint against YOLOv2 so this is the response.
From more machine learning point of view. This is implementation of something which is called feature pyramid. This concept is popularized by Retina network architecture.
You process input image, apply convolutions, maxpooling etc. up to some point, this feature map you use as input to your output layer(13 x 13 in YOLOv3 case). Than you upscale feature map which was use as input for 13 x 13 layer and concatenate with feature map with corresponding size(this feature map will be taken from earlier part of network). So now you use as input for your output layer upscaled features which was preprocessed all the way along the network and feature which was computed earlier. And this leads to more accuracy. For YOLOv3 you than again take this upscaled features concatenated with earlier features upscale them, concatenate and use as input for third output layer.
来源:https://stackoverflow.com/questions/57112038/yolo-v3-model-output-clarification-with-keras