I am working with CUDA on the windows platform. On the windows platform we have access to both Parallel Nsight and Visual Profiler. Both are pretty good but then they have
Parallel Nsight has the benefit of being built right into Visual Studio and features a natural workflow for Windows developers.
In Parallel Nsight 2.2, whenever the target is set to "localhost", the Monitor is started automatically. This is true for both Analysis and CUDA profiling as well as CUDA debugging.
The Monitor takes a short time to start up (roughly the same time it takes to start your favorite web browser), but it is one time. Until the Monitor is terminated or the machine restarted, there is no need to start the Monitor again.
EDIT (change of mind): Based on reevaluating both NVIDIA Parallel Nsight and Visual Profiler, I now find NVIDIA Parallel Nsight much better for performance analysis.
The reasons are further explained by @Jeff Davis 's answer.
Nsight Visual Studio Edition 2.2 offers the following advantages over the Visual Profiler:
OVERALL
Integration into Visual Studio 2008 SP1 and 2010 (requires Professional Edition as VS Express Edition does not support integration packages).
Local and remote analysis sessions. Remote sessions can also be configured to copy the application and resources to the remote system.
Collect information from a target application or from a process tree.
Report views support more advanced grouping and filtering. Data tables can be exported to excel.
TRACE ACTIVITY
Trace OS activity including process, thread, and module lifetime, thread context switching, thread wait reasons, CPU utilization, process CPU utilization, and thread utilization.
Collect API and GPU work trace for CUDA, OpenGL 2.x-3.x, DirectX 9-11, and OpenCL 1.1 and show all information on the timeline.
Collection of call stack traces on all traced API calls or only when traced API calls return errors.
CUDA software counters to show allocated memory per context.
Additional control over what information is traced. This is critical as tracing too much information can cause the application to become CPU bound.
Timeline and tree display for user annotations from NVIDIA Tools Extensions Library and D3D Performance Markers.
CUDA PROFILING ACTIVITY
The CUDA profiler provides a method to capture your kernel and replay it many times transparent to your application. This allows collection of profiling data in non-deterministic applications and with only 1 launch of your applications. The Visual Profiler <= 5 requires the application to be deterministic so that it can relaunch the application many times.
Supports collection of many useful metrics not yet support by the Visual Profiler including warps eligible which is the most critical metric for understanding if you have sufficient occupancy and warp stall reasons to help you understand what is limiting the performance of the application.
The Visual Profiler has the following advantages:
Cross platform.
Provides expert system to review the collected information.
Links in the results to the CUDA Best Practices Guide.
Timeline can show correlation between CPU and GPU events when you click on an event.
CUDA 5.0 supports new command line profiler (nvprof).
CUDA 5.0 supports source correlation for branch divergence and memory access with bad access patterns.
CUDA 5.0 profiler is integrated into Nsight Eclipse Edition.
Better support for Tesla PM counters.
Visual Profiler in CUDA 5.0 adds a number of the features available in Nsight 1.5 and 2.x including
NVIDIA Tools Extension Library for annotating your application with ranges and markers that can be displayed in the timeline.
Concurrent kernel trace on Fermi and Kepler GPUs.
Both tools will provide your very helpful information for analyzing your application. I recommend that you use the latest version of each of the tools.
The upcoming version of Nsight VSE will have many new features for investigating the execution of your CUDA kernel. For more information see http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0430-GTC2012-Developing-CUDA-Nsight.pdf.