Atomic test-and-set in x86: inline asm or compiler-generated lock bts?

The below code when compiled for a xeon phi throws Error: cmovc is not supported on k1om.

But it does compile properly for a regular xeon processor.

#include<stdio.h>
int main()
{
    int in=5;
    int bit=1;
    int x=0, y=1;
    int& inRef = in;
    printf("in=%d\n",in);
    asm("lock bts %2,%0\ncmovc %3,%1" : "+m" (inRef), "+r"(y) : "r" (bit), "r"(x));
    printf("in=%d\n",in);
}

Compiler - icc (ICC) 13.1.0 20130121

Peter Cordes

IIRC, first-gen Xeon Phi is based on P5 cores (Pentium, and Pentium MMX). cmov wasn't introduced until P6 (aka Pentium Pro). So I think this is normal.

Just let the compiler do its job by writing a normal ternary operator.

Second, cmov is a far worse choice for this than setc, since you want to produce a 0 or 1 based on the carry flag. See my asm code below.

Also note that bts with a memory operand is super-slow, so you don't want it to generate that code anyway, esp. on a CPU that decodes x86 instructions into uops (like a modern Xeon). According to http://agner.org/optimize/, bts m, r is much slower than bts m, i even on P5, so don't do that.

Just ask the compiler for in to be in a register, or better yet, just don't use inline asm for this.

Since the OP apparently wants this to work atomically, the best solution is to use C++11's std::atomic::fetch_or, and leave it up to the compiler to generate lock bts.

std::atomic_flag has a test_and_set function, but IDK if there a way to pack them tightly. Maybe as bitfields in a struct? Unlikely though. I also don't see atomic operations for std::bitset.

Unfortunately, current versions of gcc and clang don't generate lock bts from fetch_or, even when the much-faster immediate-operand form is usable. I came up with the following (godbolt link):

#include <atomic>
#include <stdio.h>

// wastes instructions when the return value isn't used.
// gcc 6.0 has syntax for using flags as output operands

// IDK if lock BTS is better than lock cmpxchg.
// However, gcc doesn't use lock BTS even with -Os
int atomic_bts_asm(std::atomic<unsigned> *x, int bit) {
  int retval = 0;  // the compiler still provides a zeroed reg as input even if retval isn't used after the asm :/
  // Letting the compiler do the xor means we can use a m constraint, in case this is inlined where we're storing to already zeroed memory
  // It unfortunately doesn't help for overwriting a value that's already known to be 0 or 1.
  asm( // "xor      %[rv], %[rv]\n\t"
       "lock bts %[bit], %[x]\n\t"
       "setc     %b[rv]\n\t"  // hope that the compiler zeroed with xor to avoid a partial-register stall
        : [x] "+m" (*x), [rv] "+rm"(retval)
        : [bit] "ri" (bit));
  return retval;
}

// save an insn when retval isn't used, but still doesn't avoid the setc
// leads to the less-efficient setc/ movzbl sequence when the result is needed :/
int atomic_bts_asm2(std::atomic<unsigned> *x, int bit) {
  uint8_t retval;
  asm( "lock bts %[bit], %[x]\n\t"
       "setc     %b[rv]\n\t"
        : [x] "+m" (*x), [rv] "=rm"(retval)
        : [bit] "ri" (bit));
  return retval;
}


int atomic_bts(std::atomic<unsigned> *x, unsigned int bit) {
  // bit &= 31; // stops gcc from using shlx?
  unsigned bitmask = 1<<bit;
  //int oldval = x->fetch_or(bitmask, std::memory_order_relaxed);

  int oldval = x->fetch_or(bitmask, std::memory_order_acq_rel);
  // acquire and release semantics are free on x86
  // Also, any atomic rmw needs a lock prefix, which is a full memory barrier (seq_cst) anyway.

  if (oldval & bitmask)
    return 1;
  else
    return 0;
}

As discussed in What is the best way to set a register to zero in x86 assembly: xor, mov or and?, xor / set-flags / setc is the optimal sequence for all modern CPUs when the result is needed as a 0-or-1 value. I haven't actually considered P5 for that, but setcc is fast on P5 so it should be fine.

Of course, if you want to branch on this instead of storing it, the boundary between inline asm and C is an obstacle. Spending two instructions to store a 0 or 1, only to test/branch on it, would be pretty dumb.

gcc6's flag-operand syntax would certainly be worth looking in to, if it's an option. (Probably not if you need a compiler that targets Intel MIC.)

来源：https://stackoverflow.com/questions/34940356/atomic-test-and-set-in-x86-inline-asm-or-compiler-generated-lock-bts

标签

assembly

x86

icc

xeon-phi