YOLO object detection: how does the algorithm predict bounding boxes larger than a grid cell?

前端 未结 3 1611
醉梦人生
醉梦人生 2021-02-15 14:54

I am trying to better understand how the YOLO2 & 3 algorithms works. The algorithm processes a series of convolutions until it gets down to a 13x13 grid. Then i

3条回答
  •  南方客
    南方客 (楼主)
    2021-02-15 15:07

    Ok this is not my first time seing this question, had the same problem and infact for all the YOLO 1 & 2 architectures I encountered during my yoloquest, no where did the network-diagrams imply some classification and localization kicked it at the first layer or the moment the image was fed in. It passes through a series of convolution layers and filters(didn't forget the pooling just feel they are the laziest elements in the network plus I hate swimming pools including the words in it).

    • Which implies at basic levels of the network flow information is seen or represented differently i.e. from pixels to outlines, shapes , features etc before the object is correctly classified or localised just as in any normal CNN

      Since the tensor representing the bounding box predictions and classifications is located towards the end of the network(I see regression with backpropagation). I believe it is more appropriate to say that the network:

      1. divides the image into cells(actually the author of the network did this with the training label datasets)
      2. for each cell divided, tries to predict bounding boxes with confidence scores(I believe the convolution and filters right after the cell divisions are responsible for being able to correctly have the network predict bounding boxes larger than each cell because they feed on more than one cell at a time if you look at the complete YOLO architecture, there's no incomplete one).

      So to conclude, my take on it is that the network predicts larger bounding boxes for a cell and not that each cell does this i.e. The network can be viewed as a normal CNN that has outputs for each classification + number of bounding boxes per cell whose sole goal is to apply convolutions and feature maps to detect, classify and localise objects with a forward pass.

    forward pass implying neighbouring cells in the division don't query other cells backwardly/recursively, prediction of larger bounding boxes are by next feature maps and convolutions connected to receptive areas of previous cell divisions. also the box being centroidal is a function of the training data, if it's changed to top-leftiness it wouldn't be centroidal(forgive the grammar).

提交回复
热议问题