Are there any better ways of performing simple scalar operations on device other than repeatedly launching tiny kernels? I am trying to fully pipeline a set of vector routin