I have a program which I\'m compiling like this:
(...) Some ifort *.f -c
nvcc -c src/bicgstab.cu -o bicgstab.o -I/home/ricardo/apps/cusp/cusplibrary
(...) Some m
ipo is fairly complicated when there are multiple files involved. It's actually rerunning the compiler on all modules at link time. I'm not an expert on this, but that sounds like something fairly difficult to wade through.
One possible option might be that you try to compile your cuda code into a shared library (.so) and link against that. It should prevent the intel compiler toolchain from trying to recompile and optimize against the code generated by nvcc/gcc. I think this is going to limit you to "single file optimizations". Don't know if that will significantly affect your performance or not.
Using my example here, I would modify the compile commands as follows:
$ nvcc -Xcompiler="-fPIC" -shared bicgstab.cu -o bicgstab.so -I/home-2/robertc/misc/cusp/cusplibrary-master
$ ifort -c -fast bic.f90
$ ifort bic.o bicgstab.so -L/shared/apps/cuda/CUDA-v6.0.37/lib64 -lcudart -o program
ipo: remark #11001: performing single-file optimizations
ipo: remark #11006: generating object file /tmp/ipo_ifortxEdpin.o
$
You don't indicate where in your compile process you are adding the -fast
switch(es). If only on the ifort
compile commands, I believe the above approach will work. If you also want/need it on the link command, then it appears that ifort wants to build an entirely statically linked executable (and do intermodule optimization...), which won't work using the above process.