Fast(est) way to write a seqence of integer to global memory?

后端未结

关注

 2  462

The task is very simple, writting a seqence of integer variable to memory:

Original code:

for (size_t i=0; i<1000*1000*1000; ++i)
{
   data[i]=i;
};
<


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  既然无缘        
                
              
                            
                2021-02-05 22:30
              
            
            
                                                                       
Is there any reason why you would expect all of data[] to be in powered-up RAM pages? 

The DDR3 pre-fetchter will correctly predict most accesses but the frequent x86-64 page boundaries might be an issue. You're writing to virtual memory, so at each page boundary there's a potential mis-prediction of the pre-fetcher. You can greatly reduce this by using large pages (e.g. MEM_LARGE_PAGES on Windows).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  死守一世寂寞        
                
              
                            
                2021-02-05 22:49
              
            
            
                                                                       
Assuming this is x86, and that you are not already saturating your available DRAM bandwidth,  you can try using SSE2 or AVX2 to write 2 or 4 elements at a time:

SSE2:

#include "emmintrin.h"

const __m128i v2 = _mm_set1_epi64x(2);
__m128i v = _mm_set_epi64x(1, 0);

for (size_t i=0; i<1000*1000*1000; i += 2)
{
    _mm_stream_si128((__m128i *)&data[i], v);
    v = _mm_add_epi64(v, v2);
}


AVX2:

#include "immintrin.h"

const __m256i v4 = _mm256_set1_epi64x(4);
__m256i v = _mm256_set_epi64x(3, 2, 1, 0);

for (size_t i=0; i<1000*1000*1000; i += 4)
{
    _mm256_stream_si256((__m256i *)&data[i], v);
    v = _mm256_add_epi64(v, v4);
}


Note that data needs to be suitably aligned (16 byte or 32 byte boundary).

AVX2 is only available on Intel Haswell and later, but SSE2 is pretty much universal these days.



FWIW I put together a test harness with a scalar loop and the above SSE and AVX loops compiled it with clang, and tested it on a Haswell MacBook Air (1600MHz LPDDR3 DRAM). I got the following results:

# sequence_scalar: t = 0.870903 s = 8.76033 GB / s
# sequence_SSE: t = 0.429768 s = 17.7524 GB / s
# sequence_AVX: t = 0.431182 s = 17.6941 GB / s


I also tried it on a Linux desktop PC with a 3.6 GHz Haswell, compiling with gcc 4.7.2, and got the following:

# sequence_scalar: t = 0.816692 s = 9.34183 GB / s
# sequence_SSE: t = 0.39286 s = 19.4201 GB / s
# sequence_AVX: t = 0.392545 s = 19.4357 GB / s


So it looks like the SIMD implementations give a 2x or more improvement over 64 bit scalar code (although 256 bit SIMD doesn't seem to give any improvement over 128 bit SIMD), and that typical throughput should be a lot faster than 5 GB / s.

My guess is that there is something wrong with the OP's system or benchmarking code which is resulting in an apparently reduced throughput.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复