I had a quick look on the forums and I don\'t think this question has been asked already.
I am currently working with an MPI/CUDA hybrid code, made by somebody else
Another option is since you are already using TAU to profile the CPU side of the application you could also use TAU to collect the GPU performance data. TAU supports multi-gpu execution along with MPI, take a look at http://www.nic.uoregon.edu/tau-wiki/Guide:TAUGPU for instructions on how to get started using TAU's GPU profiling capabilites. TAU uses CUPTI (CUda Performance Tools Interface) underneath and so the data you will be able to collect with TAU will be very similar to what to can collect with nVidia's Visual Profiler.