I have an input image 416x416. How can I create an output of 4 x 10, where 4 is number of columns and 10 the number of rows?
My label data is 2D array with 4 columns and
First flatten the (None, 13, 13, 1024)
layer
model.add(Flatten())
it will give 13*13*1024=173056
1 dimensional tensor
Then add a dense layer
model.add(Dense(4*10))
it will output to 40
this will transform your 3D shape to 1D
then simply resize to your needs
model.add(Reshape(4,10))
This will work but will absolutely destroy the spatial nature of your data
I believe the easiest way to conform your predictions shape with the desired output is the solution proposed by @Darlyn. Assuming the network you have so far was declared (that outputs tensors of shape (13, 13, 1024)
) as this:
x = Input(shape=(416, 416, 3))
y = Conv2D(32, activation='relu')(x)
...
y = Conv2D(1024, activation='relu')(y)
You just need to add a regression layer that will try to predict the boxes, and then reshape these to (10, 4)
:
from keras.layers import Flatten, Dense, Reshape
samples = 1
boxes = 10
y = Flatten(name='flatten')(model.outputs)
y = Dense(boxes * 4, activation='relu')(y)
y = Reshape((boxes, 4), name='predictions')(y)
model = Model(inputs=model.inputs, outputs=y)
x_train = np.random.randn(samples, 416, 416, 3)
p = model.predict(x_train)
print(p.shape)
(1, 10, 4)
This works, but I'm not entire secure that directly regressing these values will produce good results. I usually see object-detection models using attention, region or saliency to determine the position of objects. There are a couple of object-detection keras implementations you could try:
classes = ["dog", "cat", "hooman"]
backbone = keras_rcnn.models.backbone.VGG16
model = keras_rcnn.models.RCNN((416, 416, 3), classes, backbone)
boxes, predictions = model.predict(x)
from keras_retinanet.models.resnet import resnet_retinanet
x = Input(shape=(416, 416, 3))
model = resnet_retinanet(len(classes), inputs=x)
_, _, boxes, _ = model.predict_on_batch(inputs)