OpenCL global memory fetches

前端 未结 3 1376
予麋鹿
予麋鹿 2021-01-03 06:10

I am thinking about reworking my GPU OpenCL kernel to speed things up. The problem is there is a lot of global memory that is not coalesced and fetches are really bringing d

相关标签:
3条回答
  • 2021-01-03 06:30

    I am not able to understand you question properly , but if you have large global access and if those are re-used than use use local memory.

    Note:small local work size less data shared so no use, large local work size less parallel threads . So you need to select the best one.

    0 讨论(0)
  • 2021-01-03 06:31

    In general, fewer fecthes of larger size will be more efficient. I can't give you specific advice without seeing your code, but make sure to access sequential chunks from the work-items to enable 'streaming'. Do any transpositions or random memory accesses after you bring the data into local memory.

    0 讨论(0)
  • 2021-01-03 06:49

    You can use clGetDeviceInfo to find out what the cacheline size is for a device. (clGetDeviceInfo, CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE) On many devices today, this value is typically 16 bytes.

    Small reads can be troublesome, but if you are reading from the same cacheline, you should be fine. The short answer: you need to keep your 'small chunks' close together in memory to keep it fast.

    I have two functions below to demonstrate two ways to access the memory -- vectorAddFoo, and vectorAddBar. The third function copySomeMemory(...) applies to your question specifically. Both vector functions have their work items add a portion of the vectors being added, but use different memory access patterns. vectorAddFoo gets each work item to process a block of vector elements, starting at its calculated position in the arrays, and moving forward through its workload. vectorAddBar has work items start at their gid and skip gSize (= global size) elements before fetching and adding the next elements.

    vectorAddBar will execute faster because of the reads and writes falling into the same cacheline in memory. Every 4 float reads will fall on the same cacheline, and take only one action from the memory controller to perform. After reading a[] and b[] in this matter, all four work items will be able to do their addition, and queue their write to c[].

    vectorAddFoo will guarantee the reads and writes are not in the same cacheline (except for very short vectors ~totalElements<5). Every read from a work item will require an action from the memory controller. Unless the gpu caches the following 3 floats in every case, this will result in 4x the memory access.

    __kernel void  
    vectorAddFoo(__global const float * a,  
              __global const float * b,  
              __global       float * c,
              __global const totalElements) 
    { 
      int gid = get_global_id(0); 
      int elementsPerWorkItem = totalElements/get_global_size(0);
      int start = elementsPerWorkItem * gid;
    
      for(int i=0;i<elementsPerWorkItem;i++){
        c[start+i] = a[start+i] + b[start+i]; 
      }
    } 
    __kernel void  
    vectorAddBar(__global const float * a,  
              __global const float * b,  
              __global       float * c,
              __global const totalElements) 
    { 
      int gid = get_global_id(0); 
      int gSize = get_global_size(0);
    
      for(int i=gid;i<totalElements;i+=gSize){
        c[i] = a[i] + b[i]; 
      }
    } 
    __kernel void  
    copySomeMemory(__global const int * src,
              __global const count,
              __global const position) 
    { 
      //copy 16kb of integers to local memory, starting at 'position'
      int start = position + get_local_id(0); 
      int lSize = get_local_size(0);
      __local dst[4096];
      for(int i=0;i<4096;i+=lSize ){
        dst[start+i] = src[start+i]; 
      }
      barrier(CLK_GLOBAL_MEM_FENCE);
      //use dst here...
    } 
    
    0 讨论(0)
提交回复
热议问题