What does the coordinate output of yolo algorithm represent?

问题

My question is similar to this topic. I was watching this lecture on bounding box prediction by Andrew Ng when I started thinking about output of yolo algorithm. Let's consider this example, We use 19x19 grids and only one receptive field with 2 classes, so our output will be => 19x19x1x5. The last dimension(array of size 5) represents the following:

1) The class (0 or 1)  
2) X-coordinate  
3) Y-coordinate  
4) height of the bounding box  
5) Width of the bounding box

I don't understand whether X,Y coordinates represent the bounding box with respect to the size of entire image or just and receptive field(filter). In the video the bounding box is represented as a part of receptive field but logically receptive field is much smaller than bounding box and also people might tinker with filter size, so positioning bounding boxes with respect to filter makes no sense.

So, basically what does the coordinates of bounding boxes of an image represent ?

回答1:

From Understanding YOLO post @ Hacker Noon:

Each grid cell predicts B bounding boxes as well as C class probabilities. The bounding box prediction has 5 components: (x, y, w, h, confidence). The (x, y) coordinates represent the center of the box, relative to the grid cell location (remember that, if the center of the box does not fall inside the grid cell, than this cell is not responsible for it). These coordinates are normalized to fall between 0 and 1. The (w, h) box dimensions are also normalized to [0, 1], relative to the image size. Let’s look at an example:

来源：https://stackoverflow.com/questions/52455429/what-does-the-coordinate-output-of-yolo-algorithm-represent

标签

machine-learning

deep-learning

computer-vision

conv-neural-network

yolo