Cython parallelisation race condition for DFS

泪湿孤枕 提交于 2019-12-14 01:55:55

问题


I'm attempting to develop an AI to play a 1-player board game optimally. I'm using a depth-first search to a few levels.

I've attempted to speed it up by multithreading the initial loop iterating over all moves and recursing into the game trees. My idea is that each thread will split-up the initial possible move boards into chunks and further evaluate these in a separate recursive function. All functions called are nogil

However, I'm encountering what I can only guess is a race condition because the multi-threaded solution gives different results, and I'm not sure how to go about fixing it.

cdef struct Move:
   int x
   int y
   int score

cdef Move search( board_t& board, int prevClears, int maxDepth, int depth ) nogil:
   cdef Move bestMove
   cdef Move recursiveMove
   cdef vector[ Move ] moves = generateMoves( board )
   cdef board_t nextBoard
   cdef int i, clears

   bestMove.score = 0

   # Split the initial possible move boards amongst threads
   for i in prange( <int> moves.size(), nogil = True ):
      # Applies move and calculates the move score
      nextBoard = applyMove( board, moves[ i ], prevClears, maxDepth, depth )

      # Recursively evaluate further moves
      if maxDepth - depth > 0:
         clears = countClears( nextBoard )
         recursiveMove = recursiveSearch( nextBoard, moves[ i ], clears, maxDepth, depth + 1 )
         moves[ i ].score += recursiveMove.score

      # Update bestMove
      if moves[ i ].score > bestMove.score:
         bestMove = moves[ i ]

   return bestMove

回答1:


Cython does some magic, which depends on subtle things, when prange is involved - so one really has to look at the resulting C code to understand what is going on.

As far as I can see your code, there are at least 2 problems.

1. Problem: bestMove isn't initialized.

%%cython -+
cdef struct Move:
   ...

def foo()
   cdef Move bestMove
   return bestMove

would result in the following C-code:

...
struct __pyx_t_XXX_Move __pyx_v_bestMove;
...
__pyx_r = __pyx_convert__to_py_struct____pyx_t_XXX_Move(__pyx_v_bestMove); if ...
return __pyx_r;

The local variable __pyx_v_bestMove will stay uninitialized (see e.g. this SO-post), even if it is well possible, that the initial value will consist only out of zeros.

Were bestMove for example an int, Cython would give a warning, but it doesn't for structs.

2. Problem: assigning bestMove leads to racing condition.

Btw, the result might not only be not the best move, but even an illegal move alltogether as it could be a combination (x-,y-,score- values from different legal moves) of other assigned legal moves.

Here is a smaller reproducer of the issue:

%%cython -c=-fopenmp --link-args=-fopenmp
# cython
cimport cython
from cython.parallel import prange

cdef struct A:
    double a

@cython.boundscheck(False)
def search_max(double[::1] vals):
    cdef A max_val = [-1.0] # initialized!
    cdef int i
    cdef int n = len(vals)
    for i in prange(n, nogil=True):
        if(vals[i]>max_val.a):
            max_val.a = vals[i]
    return max_val.a

Were max_val a cdef double Cython wouldn't build it as it would try to make max_val private (subtly magic). But now, max_val is shared between threads (see resulting C-code) and the access to it should be guarded. If not we can see (one might need to run multiple times to trigger the race condition) the result:

>>> import numpy as np
>>> a = np.random.rand(1000)
>>> search_max(a)-search_max(a)
#0.0006253360398751351 but should be 0.0

What can be done? As @DavidW has proposed, we could collect maximum per thread and then find absolute maximum in a post process step - see this SO-post, which leads to:

%%cython -+ -c=-fopenmp --link-args=-fopenmp

cimport cython
from cython.parallel import prange, threadid
from libcpp.vector cimport vector
cimport openmp

cdef struct A:
    double a

@cython.boundscheck(False)
def search_max(double[::1] vals):
    cdef int i, tid
    cdef int n = len(vals)
    cdef vector[A] max_vals
    # every thread gets its own max value:
    NUM_THREADS = 4
    max_vals.resize(NUM_THREADS, [-1.0])
    for i in prange(n, nogil=True, num_threads = NUM_THREADS):
        tid = threadid()
        if(vals[i]>max_vals[tid].a):
            max_vals[tid].a = vals[i]

    #post process, collect results of threads:
    cdef double res = -1.0
    for i in range(NUM_THREADS):
        if max_vals[i].a>res:
            res = max_vals[i].a

    return res

I think it is easier and less error prone to use openmp functionality with C/C++ and wrap the resulting code with Cython: Not only doesn't Cython support everything what openmp offers, but seeing problems in parallel code is hard enough when looking at simple C-code, without any implicit magic done by Cython.



来源:https://stackoverflow.com/questions/59005318/cython-parallelisation-race-condition-for-dfs

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!