You are right in that the last convolutional layer has 256 x 13 x 13 = 43264
neurons. However, there is a max-pooling layer with stride = 3
and pool_size = 2
. This will produce an output of size 256 x 6 x 6
. You connect this to a fully-connected layer. In order to do that, you first have to flatten the output, which will take the shape - 256 x 6 x 6 = 9216 x 1
. To map 9216
neurons to 4096
neurons, we introduce a 9216 x 4096
weight matrix as the weight of dense/fully-connected layer. Therefore, w^T * x = [9216 x 4096]^T * [9216 x 1] = [4096 x 1]
. In short, each of the 9216
neurons will be connected to all 4096
neurons. That is why the layer is called a dense or a fully-connected layer.
As others have said it above, there is no hard rule about why this should be 4096. The dense layer just has to have enough number of neurons so as to capture variability of the entire dataset. The dataset under consideration - ImageNet 1K - is quite difficult and has 1000 categories. So 4096
neurons to start with do not seem too much.