Modern accelerator such as GPU, when executes a forward pass on neural network, will it execute layer by layer? That is, will it finish ALL work of the previous layer, then star