Issue with __m256 type of intel intrinsics

后端未结

关注

 2  2042

I\'m trying to test some of the Intel Intrinsics to see how they work. So, i created a function to do that for me and this is the code:

void test_intel_256()


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  感动是毒        
                
              
                            
                2021-01-13 01:56
              
            
            
                                                                       
MMX and SSE2 are baseline for x86-64, but AVX is not.  You do need to specifically enable AVX, where you didn't for SSE2.

Build with -march=haswell or whatever CPU you actually have.  Or just use -mavx.

Beware that gcc -mavx with the default tune=generic will split 256b loadu/storeu intrinsics into vmovups xmm / vinsertf128, which is bad if your data is actually aligned most of the time, and especially bad on Haswell with limited shuffle-port throughput.

It's good for Sandybridge and Bulldozer-family if your data really is unaligned, though.  See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568: it even affects AVX2 vector-integer code, even though all AVX2 
CPUs (except maybe Excavator and Ryzen) are harmed by this tuning.  tune=generic doesn't take into account what instruction-set extension are enabled, and there's no tune=generic-avx2.

You could use -mavx2 -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store.  That still doesn't enable other tuning options (like optimizing for macro-fusion of compare and branch) that all modern x86 CPUs have (except low-power ones), but that isn't enabled by gcc's tune=generic.  (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78855).



Also:


  I'm including these libraries mmintrin.h, emmintrin.h, xmmintrin.h


Don't do that.  Always just include immintrin.h in SIMD code.  It pulls in all Intel SSE/AVX extensions.  This is why you get error: unknown type name ‘__m256’



Keep in mind that subscripting vector types lie __m256 is non-standard and non-portable.  They're not arrays, and there's no reason you should expect [] to work like an array.  Extracting the 3rd element or something from a SIMD vector in a register requires a shuffle instruction, not a load.



If you want handy wrappers for vector types that let you do stuff like use operator[] to extract scalars from elements of vector variables, have a look at Agner Fog's Vector Class Library.  It's GPLed, so you'll have to look at other wrapper libraries if that's a problem.

It lets you do stuff like

// example from the manual for operator[]
Vec4i a(10,11,12,13);
int b = a[2];   // b = 12


You can use normal intrinsics on VCL types.  Vec8f is a transparent wrapper on __m256, so you can use it with _mm256_mul_ps.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  [愿得一人]        
                
              
                            
                2021-01-13 02:01
              
            
            
                                                                       
try this out  

res=_MM_ADD_PS(vec1,vec2);
because the prototype of the __M256_MM_ADD_PS is 

__m256 _MM_ADD_PS(__m256,__m256);

it takes two __m256 data types as the parameters and returns their sum as __m256 data, just like

int add(int , int);

for initializing 

vec=_MM_setr_PS(7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0)  or

vec =_MM_LOAD_PS(&arr) or

vec =_MM_LOAD_PS(ptr)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复