I have an input image 416x416. How can I create an output of 4 x 10, where 4 is number of columns and 10 the number of rows?
My label data is 2D array with 4 columns and
I believe the easiest way to conform your predictions shape with the desired output is the solution proposed by @Darlyn. Assuming the network you have so far was declared (that outputs tensors of shape (13, 13, 1024)
) as this:
x = Input(shape=(416, 416, 3))
y = Conv2D(32, activation='relu')(x)
...
y = Conv2D(1024, activation='relu')(y)
You just need to add a regression layer that will try to predict the boxes, and then reshape these to (10, 4)
:
from keras.layers import Flatten, Dense, Reshape
samples = 1
boxes = 10
y = Flatten(name='flatten')(model.outputs)
y = Dense(boxes * 4, activation='relu')(y)
y = Reshape((boxes, 4), name='predictions')(y)
model = Model(inputs=model.inputs, outputs=y)
x_train = np.random.randn(samples, 416, 416, 3)
p = model.predict(x_train)
print(p.shape)
(1, 10, 4)
This works, but I'm not entire secure that directly regressing these values will produce good results. I usually see object-detection models using attention, region or saliency to determine the position of objects. There are a couple of object-detection keras implementations you could try:
classes = ["dog", "cat", "hooman"]
backbone = keras_rcnn.models.backbone.VGG16
model = keras_rcnn.models.RCNN((416, 416, 3), classes, backbone)
boxes, predictions = model.predict(x)
from keras_retinanet.models.resnet import resnet_retinanet
x = Input(shape=(416, 416, 3))
model = resnet_retinanet(len(classes), inputs=x)
_, _, boxes, _ = model.predict_on_batch(inputs)