OpenMP offloaded target region executed in both host and target-device

I'm working on a project which requires OpenMP offloading to Nvidia GPUs using Clang. I was able to install Clang to support offloading by following instructions mentioned here.

System specification

OS - Ubuntu 16.04 LTS
Clang -version 4.00
Processor - Intel(R) Core(TM) i7 -4700MQ CPU
Cuda -version - 9.0
Nvidia GPU - GeForce 740M (sm_capability - 35)

But the problem is I when I execute a sample program to test OpenMP to Nvidia GPUs, part of the target region tends to run in GPU and then same target region starts executing in the host.

Please find the sample program here, This a small C program written to multiply 2 matrices.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>

/* Problem size. */
# define N 1920

void init_array(float* A, float* B)
{
    int i, j;
    for (i = 0; i < N; i++)
    {
        for (j = 0; j < N; j++)
        {
            A[i*N + j] = ((float) i*j) / N;
        }
    }

    for (i = 0; i < N; i++)
    {
        for (j = 0; j < N; j++)
        {
            B[i*N + j] = ((float) i*(j+1)) / N;
        }
    }
}
void  mm_kernel(float *A, float *B, float *E)
{

    #pragma omp target data map(to:A) map(to:B) map(alloc:E)
{
    #pragma omp target
    #pragma omp teams distribute num_teams(4)
        for (int i = 0; i < N; i++)
  {
        printf("Team %d Thread %d Number of threads %d \n", omp_get_team_num() ,omp_get_thread_num(),omp_get_num_threads());
        #pragma omp  parallel for
        for (int j = 0; j < N; j++)
    {
            E[i*N + j] = 0.0;
            for(int k = 0; k < N; k++)
            {
                E[i*N + j] = E[i*N + j] + A[i*N + k] * B[j*N+k];
            }
    }
    }
  }
    }

int main(){
  double t_start, t_end;

    float* A;
    float* B;
    float* E;

    A = (float*)malloc(N*N*sizeof(float));
    B = (float*)malloc(N*N*sizeof(float));
    E = (float*)malloc(N*N*sizeof(float));
    init_array(A, B); //initialize Matrix A and B

    t_start = omp_get_wtime();
    mm_kernel(A,B,E);
    t_end = omp_get_wtime();

    printf("Time spent %lf\n",t_end-t_start );
    free(A);
    free(B);
    free(E);
}

The program was complied using

clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda 3mm.c -o 3mmgpu

The Main reason to claim that target regions are executing in both the host and the target device is due to the output from the command line.

At first team 0 and team 1 shows 960 per each team and later iterations gives 2 threads per each teams(My processor is 4 core processor capable of handling 2 hardware level threads per core.).

I also tried executing the fat binary with nvprof in order to verify whether anything is being executed in the GPU.

profiling results are as follows.

Actually I cannot understand what is happening in the target region. Why the target region is being executed in both host and target-device.

I'm posting the answer to the question, as I was finally able to figure out what went wrong in the code. The problem was offloaded region in the target-device crashes as I have incorrectly mapped data to the GPU. I have only mapped pointers without allocating memory in the GPU. So as the GPU execution crashes, execution happens in the host.

Thank you @Alexey Bataev for pointing that out.

来源：https://stackoverflow.com/questions/49718730/openmp-offloaded-target-region-executed-in-both-host-and-target-device

标签

cuda

clang

openmp

nvidia

offloading