vtune

Is it possible to use vtune on certain code snippets in a binary and not an entire binary?

[亡魂溺海] 提交于 2019-12-09 23:01:11
问题 I am adding usage of a small library to a large existing piece of software and would like to analyze (in finder detail than just in&out rdtsc() or gettimeofday calls) the overhead and it's attribution of the small library. Using things like rdtsc() I can get a sense of the latency that calling my libraries functions have, but I cannot do latency attribution unless I am also able to see whether branches are not being predicted well, caching isnt working properly, etc..I looked into PAPI as I

Optimzing SSE-code

旧时模样 提交于 2019-12-08 22:19:26
问题 I'm currently developing a C-module for a Java-application that needs some performance improvements (see Improving performance of network coding-encoding for a background). I've tried to optimize the code using SSE-intrinsics and it executes somewhat faster than the Java-version (~20%). However, it's still not fast enough. Unfortunately my experience with optimizing C-code is somewhat limited. I therefore would love to get some ideas on how to improve the current implementation. The inner

MKL Performance on Intel Phi

本小妞迷上赌 提交于 2019-12-08 19:47:46
问题 I have a routine that performs a few MKL calls on small matrices (50-100 x 1000 elements) to fit a model, which I then call for different models. In pseudo-code: double doModelFit(int model, ...) { ... while( !done ) { cblas_dgemm(...); cblas_dgemm(...); ... dgesv(...); ... } return result; } int main(int argc, char **argv) { ... c_start = 1; c_stop = nmodel; for(int c=c_start; c<c_stop; c++) { ... result = doModelFit(c, ...); ... } } Call the above version 1. Since the models are independent

How to improve memory performance/data locality of 64-bit C/intel assembly program

拜拜、爱过 提交于 2019-12-08 06:26:20
问题 I am using a hobby program to teach myself high performance computing techniques. My PC has an Intel Ivy Bridge Core i7 3770 processor with 32 GB of memory and the free version of Microsoft vs2010 C compiler. The 64-bit program needs about 20 GB of memory because it has five 4 GB lookup tables (bytevecM ... bytevecX below). The inner loop of this search program was written as a separate C file (since I may want to replace it later with an assembler version), shown below: #define H_PRIME

How do I generate symbol information to use with Linux version of Intel's VTune Amplifier?

删除回忆录丶 提交于 2019-12-08 00:51:08
问题 I am using Intel VTune Amplifier XE 2011 to analyze the performance of my program. I want to be able to view the source code in the analysis results, and the documentation says I need to provide the symbol information. Unfortunately, it does not state how to generate that symbol information when compiling my program. In the Windows version of VTune all I had to do was provide the ".pdb" file that Microsoft Visual Studio would generate. Is there a similar kind of file I can create using g++ to

Profiling help required

心已入冬 提交于 2019-12-07 00:41:00
问题 I have a profiling issue - imagine I have the following code... void main() { well_written_function(); badly_written_function(); } void well_written_function() { for (a small number) { highly_optimised_subroutine(); } } void badly_written_function() { for (a wastefully and unnecessarily large number) { highly_optimised_subroutine(); } } void highly_optimised_subroutine() { // lots of code } If I run this under vtune (or other profilers) it is very hard to spot that anything is wrong. All the

How do I generate symbol information to use with Linux version of Intel's VTune Amplifier?

守給你的承諾、 提交于 2019-12-06 12:21:01
I am using Intel VTune Amplifier XE 2011 to analyze the performance of my program. I want to be able to view the source code in the analysis results, and the documentation says I need to provide the symbol information. Unfortunately, it does not state how to generate that symbol information when compiling my program. In the Windows version of VTune all I had to do was provide the ".pdb" file that Microsoft Visual Studio would generate. Is there a similar kind of file I can create using g++ to provide this symbol information? JimR gcc -g <your stuff> should be all that's necessary. However I

Why does g++ (4.6 and 4.7) promote the result of this division to a double? Can I stop it?

↘锁芯ラ 提交于 2019-12-05 13:59:24
I was writing some templated code to benchmark a numeric algorithm using both floats and doubles, in order to compare against a GPU implementation. I discovered that my floating point code was slower and after investigating using Vtune Amplifier from Intel I discovered that g++ was generating extra x86 instructions (cvtps2pd/cvtpd2ps and unpcklps/unpcklpd) to convert some intermediate results from float to double and then back again. The performance degradation is almost 10% for this application. After compiling with the flag -Wdouble-promotion (which BTW is not included with -Wall or -Wextra)

Profiling help required

拜拜、爱过 提交于 2019-12-05 04:28:22
I have a profiling issue - imagine I have the following code... void main() { well_written_function(); badly_written_function(); } void well_written_function() { for (a small number) { highly_optimised_subroutine(); } } void badly_written_function() { for (a wastefully and unnecessarily large number) { highly_optimised_subroutine(); } } void highly_optimised_subroutine() { // lots of code } If I run this under vtune (or other profilers) it is very hard to spot that anything is wrong. All the hotspots will appear in the section marked "// lots of code" which is already optimised. The badly

Is it possible to use vtune on certain code snippets in a binary and not an entire binary?

北城余情 提交于 2019-12-04 17:20:53
I am adding usage of a small library to a large existing piece of software and would like to analyze (in finder detail than just in&out rdtsc() or gettimeofday calls) the overhead and it's attribution of the small library. Using things like rdtsc() I can get a sense of the latency that calling my libraries functions have, but I cannot do latency attribution unless I am also able to see whether branches are not being predicted well, caching isnt working properly, etc..I looked into PAPI as I imagined looking at a certain hardware events going into and out of a routine in my library within the