I\'m trying to write neural networks from scratch, and since datasets are huge, I want my 4 cores to process the batches in parallel. But the final execution time for multip