I\'m currently writing a custom DNN kernel to run on ARM-based devices. (Using C++, with GCC 7.5.0)
My code slices the given workload into batches that would fit in t