Fast and Lock Free Single Writer, Multiple Reader

问题

I've got a single writer which has to increment a variable at a fairly high frequence and also one or more readers who access this variable on a lower frequency.

The write is triggered by an external interrupt.

Since i need to write with high speed i don't want to use mutexes or other expensive locking mechanisms.

The approach i came up with was copying the value after writing to it. The reader now can compare the original with the copy. If they are equal, the variable's content is valid.

Here my implementation in C++

template<typename T>
class SafeValue
{
private:
    volatile T _value;
    volatile T _valueCheck;
public:
    void setValue(T newValue)
    {
        _value = newValue;
        _valueCheck = _value;
    }

    T getValue()
    {
        volatile T value;
        volatile T valueCheck;
        do
        {
            valueCheck = _valueCheck;
            value = _value;
        } while(value != valueCheck);

        return value;
    }
}

The idea behind this is to detect data races while reading and retry if they happen. However, i don't know if this will always work. I haven't found anything about this aproach online, therefore my question:

Is there any problem with my aproach when used with a single writer and multiple readers?

I already know that high writing frequencys may cause starvation of the reader. Are there more bad effects i have to be cautious of? Could it even be that this isn't threadsafe at all?

Edit 1:

My target system is a ARM Cortex-A15.

T should be able to become at least any primitive integral type.

Edit 2:

std::atomic is too slow on reader and writer site. I benchmarked it on my system. Writes are roughly 30 times slower, reads roughly 50 times compared to unprotected, primitive operations.

回答1:

Is this single variable just an integer, pointer, or plain old value type, you can probably just use std::atomic.

回答2:

You should try using std::atomic first, but make sure that your compiler knows and understands your target architecture. Since you are targeting Cortex-A15 (ARMv7-A cpu), make sure to use -march=armv7-a or even -mcpu=cortex-a15.

The first shall generate ldrexd instruction which should be atomic according to ARM docs:

Single-copy atomicity

In ARMv7, the single-copy atomic processor accesses are:

all byte accesses

all halfword accesses to halfword-aligned locations

all word accesses to word-aligned locations

memory accesses caused by LDREXD and STREXD instructions to doubleword-aligned locations.

The latter shall generate ldrd instruction which should be atomic on targets supporting Large Physical Address Extension:

In an implementation that includes the Large Physical Address Extension, LDRD and STRD accesses to 64-bit aligned locations are 64-bit single-copy atomic as seen by translation table walks and accesses to translation tables.

--- Note ---

The Large Physical Address Extension adds this requirement to avoid the need to complex measures to avoid atomicity issues when changing translation table entries, without creating a requirement that all locations in the memory system are 64-bit single-copy atomic.

You can also check how Linux kernel implements those:

#ifdef CONFIG_ARM_LPAE
static inline long long atomic64_read(const atomic64_t *v)
{
    long long result;

    __asm__ __volatile__("@ atomic64_read\n"
"   ldrd    %0, %H0, [%1]"
    : "=&r" (result)
    : "r" (&v->counter), "Qo" (v->counter)
    );

    return result;
}
#else
static inline long long atomic64_read(const atomic64_t *v)
{
    long long result;

    __asm__ __volatile__("@ atomic64_read\n"
"   ldrexd  %0, %H0, [%1]"
    : "=&r" (result)
    : "r" (&v->counter), "Qo" (v->counter)
    );

    return result;
}
#endif

回答3:

There's no way anyone can know. You would have to see if either your compiler documents any multi-threaded semantics that would guarantee that this will work or look at the generated assembler code and convince yourself that it will work. Be warned that in the latter case, it is always possible that a later version of the compiler, or different optimizations options or a newer CPU, might break the code.

I'd suggest testing std::atomic with the appropriate memory_order. If for some reason that's too slow, use inline assembly.

回答4:

Another option is to have a buffer of non-atomic values the publisher produces and an atomic pointer to the latest.

#include <atomic>
#include <utility>

template<class T>
class PublisherValue {
    static auto constexpr N = 32;
    T values_[N];
    std::atomic<T*> current_{values_};

public:
    PublisherValue() = default;
    PublisherValue(PublisherValue const&) = delete;
    PublisherValue& operator=(PublisherValue const&) = delete;

    // Single writer thread only.
    template<class U>
    void store(U&& value) {
        T* p = current_.load(std::memory_order_relaxed);
        if(++p == values_ + N)
            p = values_;
        *p = std::forward<U>(value);
        current_.store(p, std::memory_order_release); // (1) 
    }

    // Multiple readers. Make a copy to avoid referring the value for too long.
    T load() const {
        return *current_.load(std::memory_order_consume); // Sync with (1).
    }
};

This is wait-free, but there is a small chance that a reader might be de-scheduled while copying the value and hence read the oldest value while it has been partially overwritten. Making N bigger reduces this risk.

来源：https://stackoverflow.com/questions/54125968/fast-and-lock-free-single-writer-multiple-reader

标签

c++

multithreading

lock-free