CUDA block synchronization differences between GTS 250 and Fermi devices

问题

So I've been working on program in which I'm creating a hash table in global memory. The code is completely functional (albeit slower) on a GTS250 which is a Compute 1.1 device. However, on a Compute 2.0 device (C2050 or C2070) the hash table is corrupt (data is incorrect and pointers are sometimes wrong).

Basically the code works fine when only one block is utilized (both devices). However, when 2 or more blocks are used, it works only on the GTS250 and not on any Fermi devices.

I understand that the warp scheduling and memory architecture between the two platforms are different and I am taking that into account when developing the code. From my understanding, using __theadfence() should make sure any global writes are committed and visible to other blocks, however, from the corrupt hash table, it appears that they are not.

I've also posted the problem on the NVIDIA CUDA developer forum and it can be found here.

Relevant code below:

__device__ void lock(int *mutex) {
    while(atomicCAS(mutex, 0, 1) != 0);
}

__device__ void unlock(int *mutex) {
    atomicExch(mutex, 0);
}

__device__ void add_to_global_hash_table(unsigned int key, unsigned int count, unsigned int sum, unsigned int sumSquared, Table table, int *globalHashLocks, int *globalFreeLock, int *globalFirstFree)
{
    // Find entry if it exists
    unsigned int hashValue = hash(key, table.count);

    lock(&globalHashLocks[hashValue]);

    int bucketHead = table.entries[hashValue];
    int currentLocation = bucketHead;

    bool found = false;
    Entry currentEntry;

    while (currentLocation != -1 && !found) {
        currentEntry = table.pool[currentLocation];
        if (currentEntry.data.x == key) {
            found = true;
        } else {
            currentLocation = currentEntry.next;
        }
    }

    if (currentLocation == -1) {
        // If entry does not exist, create entry
        lock(globalFreeLock);
        int newLocation = (*globalFirstFree)++;
        __threadfence();
        unlock(globalFreeLock);

        Entry newEntry;
        newEntry.data.x = key;
        newEntry.data.y = count;
        newEntry.data.z = sum;
        newEntry.data.w = sumSquared;
        newEntry.next = bucketHead;

        // Add entry to table
        table.pool[newLocation] = newEntry;
        table.entries[hashValue] = newLocation;
    } else {
        currentEntry.data.y += count;
        currentEntry.data.z += sum;
        currentEntry.data.w += sumSquared;
        table.pool[currentLocation] = currentEntry;
    }

    __threadfence();
    unlock(&globalHashLocks[hashValue]);
}

回答1:

As pointed out by LSChien in this post, the issue is with L1 cache coherency. While using __threadfence() will guarantee shared and global memory writes are visible to other threads, since it is not atomic, thread x in block 1 may reach a cached memory value until thread y in block 0 has executed to the threadfence instruction. Instead LSChien suggested a hack in his post of using an atomicCAS() to force the thread to read from global memory instead of a cached value. The proper way to do this is by declaring the memory as volatile, requiring that every write to that memory be visible to all other threads in the grid immediately.

回答2:

__threadfence guarantees that writes to global memory are visible to other threads in the current block before returning. That is not the same as "write operation on global memory is complete"! Think caching on each multicore.

来源：https://stackoverflow.com/questions/5540217/cuda-block-synchronization-differences-between-gts-250-and-fermi-devices

标签

cuda

synchronization

gpgpu

nvidia