Reading interlocked variables

后端 未结 10 1004
鱼传尺愫
鱼传尺愫 2020-12-24 02:23

Assume:

A. C++ under WIN32.

B. A properly aligned volatile integer incremented and decremented using InterlockedIncrement() and Interlocke

相关标签:
10条回答
  • 2020-12-24 02:47

    you should be okay. It's volatile, so the optimizer shouldn't savage you, and it's a 32-bit value so it should be at least approximately atomic. The one possible surprise is if the instruction pipeline can get around that.

    On the other hand, what's the additional cost of using the guarded routines?

    0 讨论(0)
  • 2020-12-24 02:47

    Current value reading may not need any lock.

    0 讨论(0)
  • 2020-12-24 02:47

    The Interlocked* functions prevent two different processors from accessing the same piece of memory. In a single processor system you are going to be ok. If you have a dual-core system where you have threads on different cores both accessing this value, you might have problems doing what you think is atomic without the Interlocked*.

    0 讨论(0)
  • 2020-12-24 02:54

    It depends on what you mean by "goal is to simply read the current value of _ServerState" and it depends on what set of tools and the platform you use (you specify Win32 and C++, but not which C++ compiler, and that may matter).

    If you simply want to read the value such that the value is uncorrupted (ie., if some other processor is changing the value from 0x12345678 to 0x87654321 your read will get one of those 2 values and not 0x12344321) then simply reading will be OK as long as the variable is :

    • marked volatile,
    • properly aligned, and
    • read using a single instruction with a word size that the processor handles atomically

    None of this is promised by the C/C++ standard, but Windows and MSVC do make these guarantees, and I think that most compilers that target Win32 do as well.

    However, if you want your read to be synchronized with behavior of the other thread, there's some additional complexity. Say that you have a simple 'mailbox' protocol:

    struct mailbox_struct {
        uint32_t flag;
        uint32_t data;
    };
    typedef struct mailbox_struct volatile mailbox;
    
    
    // the global - initialized before wither thread starts
    
    mailbox mbox = { 0, 0 };
    
    //***************************
    // Thread A
    
    while (mbox.flag == 0) { 
        /* spin... */ 
    }
    
    uint32_t data = mbox.data;
    
    //***************************
    
    //***************************
    // Thread B
    
    mbox.data = some_very_important_value;
    mbox.flag = 1;
    
    //***************************
    

    The thinking is Thread A will spin waiting for mbox.flag to indicate mbox.data has a valid piece of information. Thread B will write some data into mailbox.data then will set mbox.flag to 1 as a signal that mbox.data is valid.

    In this case a simple read in Thread A of mbox.flag might get the value 1 even though a subsequent read of mbox.data in Thread A does not get the value written by Thread B.

    This is because even though the compiler will not reorder the Thread B writes to mbox.data and mbox.flag, the processor and/or cache might. C/C++ guarantees that the compiler will generate code such that Thread B will write to mbox.data before it writes to mbox.flag, but the processor and cache might have a different idea - special handling called 'memory barriers' or 'acquire and release semantics' must be used to ensure ordering below the level of the thread's stream of instructions.

    I'm not sure if compilers other than MSVC make any claims about ordering below the instruction level. However MS does guarantee that for MSVC volatile is enough - MS specifies that volatile writes have release semantics and volatile reads have acquire semantics - though I'm not sure at which version of MSVC this applies - see http://msdn.microsoft.com/en-us/library/12a04hfd.aspx?ppud=4.

    I have also seen code like you describe that uses Interlocked APIs to perform simple reads and writes to shared locations. My take on the matter is to use the Interlocked APIs. Lock free inter-thread communication is full of very difficult to understand and subtle pitfalls, and trying to take a shortcut on a critical bit of code that may end up with a very difficult to diagnose bug doesn't seem like a good idea to me. Also, using an Interlocked API screams to anyone maintaining the code, "this is data access that needs to be shared or synchronized with something else - tread carefully!".

    Also when using the Interlocked API you're taking the specifics of the hardware and the compiler out of the picture - the platform makes sure all of that stuff is dealt with properly - no more wondering...

    Read Herb Sutter's Effective Concurrency articles on DDJ (which happen to be down at the moment, for me at least) for good information on this topic.

    0 讨论(0)
  • 2020-12-24 02:58

    To anyone who has to revisit this thread I want to add to what was well explained by Bartosz that _InterlockedCompareExchange() is a good alternative to standard atomic_load() if standard atomics are not available. Here is the code for atomically reading my_uint32_t_var in C on i86 Win64. atomic_load() is included as a benchmark:

     long debug_x64_i = std::atomic_load((const std::_Atomic_long *)&my_uint32_t_var);
    00000001401A6955  mov         eax,dword ptr [rbp+30h] 
    00000001401A6958  xor         edi,edi 
    00000001401A695A  mov         dword ptr [rbp-0Ch],eax 
        debug_x64_i = _InterlockedCompareExchange((long*)&my_uint32_t_var, 0, 0);
    00000001401A695D  xor         eax,eax 
    00000001401A695F  lock cmpxchg dword ptr [rbp+30h],edi 
    00000001401A6964  mov         dword ptr [rbp-0Ch],eax 
        debug_x64_i = _InterlockedOr((long*)&my_uint32_t_var, 0);
    00000001401A6967  prefetchw   [rbp+30h] 
    00000001401A696B  mov         eax,dword ptr [rbp+30h] 
    00000001401A696E  xchg        ax,ax 
    00000001401A6970  mov         ecx,eax 
    00000001401A6972  lock cmpxchg dword ptr [rbp+30h],ecx 
    00000001401A6977  jne         foo+30h (01401A6970h) 
    00000001401A6979  mov         dword ptr [rbp-0Ch],eax 
    
        long release_x64_i = std::atomic_load((const std::_Atomic_long *)&my_uint32_t_var);
    00000001401A6955  mov         eax,dword ptr [rbp+30h] 
        release_x64_i = _InterlockedCompareExchange((long*)&my_uint32_t_var, 0, 0);
    00000001401A6958  mov         dword ptr [rbp-0Ch],eax 
    00000001401A695B  xor         edi,edi 
    00000001401A695D  mov         eax,dword ptr [rbp-0Ch] 
    00000001401A6960  xor         eax,eax 
    00000001401A6962  lock cmpxchg dword ptr [rbp+30h],edi 
    00000001401A6967  mov         dword ptr [rbp-0Ch],eax 
        release_x64_i = _InterlockedOr((long*)&my_uint32_t_var, 0);
    00000001401A696A  prefetchw   [rbp+30h] 
    00000001401A696E  mov         eax,dword ptr [rbp+30h] 
    00000001401A6971  mov         ecx,eax 
    00000001401A6973  lock cmpxchg dword ptr [rbp+30h],ecx 
    00000001401A6978  jne         foo+31h (01401A6971h) 
    00000001401A697A  mov         dword ptr [rbp-0Ch],eax
    
    0 讨论(0)
  • 2020-12-24 02:59

    Your initial understanding is basically correct. According to the memory model which Windows requires on all MP platforms it supports (or ever will support), reads from a naturally-aligned variable marked volatile are atomic as long as they are smaller than the size of a machine word. Same with writes. You don't need a 'lock' prefix.

    If you do the reads without using an interlock, you are subject to processor reordering. This can even occur on x86, in a limited circumstance: reads from a variable may be moved above writes of a different variable. On pretty much every non-x86 architecture that Windows supports, you are subject to even more complicated reordering if you don't use explicit interlocks.

    There's also a requirement that if you're using a compare exchange loop, you must mark the variable you're compare exchanging on as volatile. Here's a code example to demonstrate why:

    long g_var = 0;  // not marked 'volatile' -- this is an error
    
    bool foo () {
        long oldValue;
        long newValue;
        long retValue;
    
        // (1) Capture the original global value
        oldValue = g_var;
    
        // (2) Compute a new value based on the old value
        newValue = SomeTransformation(oldValue);
    
        // (3) Store the new value if the global value is equal to old?
        retValue = InterlockedCompareExchange(&g_var,
                                              newValue,
                                              oldValue);
    
        if (retValue == oldValue) {
            return true;
        }
    
        return false;
    }
    

    What can go wrong is that the compiler is well within its rights to re-fetch oldValue from g_var at any time if it's not volatile. This 'rematerialization' optimization is great in many cases because it can avoid spilling registers to the stack when register pressure is high.

    Thus, step (3) of the function would become:

    // (3) Incorrectly store new value regardless of whether the global
    //     is equal to old.
    retValue = InterlockedCompareExchange(&g_var,
                                          newValue,
                                          g_var);
    
    0 讨论(0)
提交回复
热议问题