As far as I know, CNN rely on sliding window techniques and can only indicate if a certain pattern is present or not anywhere in given bounding boxes. Is that true?
Can one achieve localization with CNN without any help of such techniques?
Thats an open problem in image recognition. Besides sliding windows, existing approaches include predicting object location in image as CNN output, predicting borders (classifiyng pixels as belonging to image boundary or not) and so on. See for example this paper and references therein.
Also note that with CNN using max-pooling, one can identify positions of feature detectors that contributed to object recognition, and use that to suggest possible object location region.
There are some recent techniques to localize the objects in CNN's. See this paper http://cnnlocalization.csail.mit.edu/Zhou_Learning_Deep_Features_CVPR_2016_paper.pdf
It uses a layer called Global Average Pooling (GAP), and with no additional work, the CNN can localize the object it recognizes.
Also checkout this really good blog post: https://alexisbcook.github.io/2017/global-average-pooling-layers-for-object-localization/
来源:https://stackoverflow.com/questions/28178054/does-convolutional-neural-network-possess-localization-abilities-on-images