How can memory destination BTS be significantly slower than load / BTS reg,reg / store?
问题 In the general case, how can an instruction that can take memory or register operands ever be slower with memory operands then mov + mov -> instruction -> mov + mov Based on the throughput and latency found in Agner Fog's instruction tables (looking at Skylake in my case, p238) I see that the following numbers for the btr/bts instructions: instruction, operands, uops fused domain, uops unfused domain, latency, throughput mov r,r 1 1 0-1 .25 mov m,r 1 2 2 1 mov r,m 1 1 2 .5 ... bts/btr r,r 1 1