I want to implement conditional variational autoencoder for MNIST images with convolutional layers, not flattening. So the image itself is [batch_size, 28, 28, 1] (
[batch_size, 28, 28, 1]