Hamming weight ( number of 1 in a number) mixing C with assembly

后端未结

关注

 4  1377

I\'m trying to count how many number 1, are in the numbers of an array.

First I have a code in C lenguaje(work ok):

int popcount2(int* array, int le


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  时光说笑        
                
              
                            
                2020-12-21 11:47
              
            
            
                                                                       
The nicest think you can do is using built in popcount function as suggested by Paul R, but since you need to write it in assembly, this worked for me:

asm (
"start:                  \n"
        "and %0, %1      \n"
        "jz end          \n"
        "shr $0, %1      \n"
        "jnc start       \n"
        "inc %1          \n"
        "jmp start       \n"
"end:                    \n"
        : "+g" (result),
          "+r" (x)
        :
        : "cc"
);


At first two lines you just check the contents of x (and go to end if it's zero Jump Zero). Than you shift x one bit to right and:


  At the end of the shift operation, the CF flag contains the last bit shifted out of the destinationoperand. *


If there's no CF set, just go to start (Jump Not Carry) else increment result and then go to start.

And the beautiful think about assembly is that you can do things in so many ways...

asm (
"start:                  \n"
        "shr $1, %1      \n"
        "jnc loop_cond   \n"
        "inc %0          \n"
        "and %1, %1      \n"
"loop_cond:              \n"
        "jnz start       \n"

        : "+g" (result),
          "+r" (x)
        :
        : "cc"
);


Here you again use SHift Right instruction, if no CF is present just go to loop condition.

Otherwise again increment result and call binary AND (INC does modify ZF).



Using LOOP and ECX

I was curious how to do this in 3 instruction (I figured your teacher wouldn't give you bottom limit of 3 if it wasn't possible) and I realized x86 also has LOOP instruction:


  Each time the LOOP instruction is executed, the count register is decremented, then checked for 0. If the count is 0, the loop is terminated and program execution continues with the instruction following the LOOP instruction. If the count is not zero, a near jump is performed to the destination (target) operand, which is presumably the instruction at the beginning of the loop. *


And you can add input argument using GCCs input constrain:


  c - The c register. 


asm (
"start:              \n"
    "shr $1, %1      \n"
    "adc $0, %0      \n"
    "loop start      \n"

    : "+g" (result)
    : "r" (x),
      "c" (8)             // Assuming 8b type (char)
);


Just to make sure it compiles to proper assembly:

0x000000000040051f <+25>:   mov    $0x8,%ecx
0x0000000000400524 <+30>:   mov    -0x8(%rbp),%eax
0x0000000000400527 <+33>:   shr    %edx
0x0000000000400529 <+35>:   adc    $0x0,%eax
0x000000000040052c <+38>:   loop   0x400527 <main+33>


I think the first one should have a bit better performance, especially if there's just 1 bit set, this approach always does k*8 iterations.



SSE4 and single instruction

I know you have to use a loop, but just for fun... With SSE4 extension you could do this by just one instruction POPCNT:


  This instruction calculates of number of bits set to 1 in the second operand (source) and returns the count in the first operand (a destination register). *


I think (I have a quite old CPU on my notebook, so I can't test this for you) you should be able to do this with just one simple instruction:

asm (   
    "POPCNT %1, %0   \n"
    : "=r" (result)
    : "mr" (x)
    : "cc"                                                                                                                                       
);


(If you try this and you do have SSE4 extension, please let me know if it works)



Performance

I've measured times required to do 100,000,000 popcounts comparing my first and second methods with David Wohlferd's. ^{[Raw data]} 

+--------------+------------+------------+------------+
|              | 0x00000000 | 0x80000001 | 0xffffffff |
+--------------+------------+------------+------------+
| 1st solution |  0.543     |  5.040     |  3.833     |
| LOOP         | 11.530     | 11.523     | 11.523     |
| Davids       |  0.750     |  4.893     |  4.890     |
+--------------+------------+------------+------------+


If anyone can compare these 3 with SSE4's POPCNT instruction I'd be glad.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  無奈伤痛        
                
              
                            
                2020-12-21 11:49
              
            
            
                                                                       

  assembly using 3-6 lines of code.


This example uses a 4 instruction loop:

popcntx proc    near
        mov     ecx,[esp+4]             ;ecx = value to popcnt
        xor     eax,eax                 ;will be popcnt
        test    ecx,ecx                 ;br if ecx == 0
        jz      popc1
popc0:  lea     edx,[ecx-1]             ;edx = ecx-1
        inc     eax                     ;eax += 1
        and     ecx,edx                 ;ecx &= (ecx-1)
        jnz     short popc0
popc1:  ret
popcntx endp


This example uses a 3 instruction loop, but it would be slower than the 4 instruction loop version on most processors.

popcntx proc    near
        mov     eax,[esp+4]             ;eax = value to popcnt
        mov     ecx,32                  ;ecx = max # 1 bits
        test    eax,eax                 ;br if eax == 0
        jz      popc1
popc0:  lea     edx,[eax-1]             ;eax &= (eax-1)
        and     eax,edx
        loopnz  popc0
popc1:  neg     ecx
        lea     eax,[ecx+32]
        ret
popcntx endp


This is an alternative non-looping example:

popcntx proc    near
        mov     ecx,[esp+4]             ;ecx = value to popcnt
        mov     edx,ecx                 ;edx = ecx
        shr     edx,1                   ;mov upr 2 bit field bits to lwr
        and     edx,055555555h          ; and mask them
        sub     ecx,edx                 ;ecx = 2 bit field counts
                                        ; 0->0, 1->1, 2->1, 3->1
        mov     eax,ecx
        shr     ecx,02h                 ;mov upr 2 bit field counts to lwr
        and     eax,033333333h          ;eax = lwr 2 bit field counts
        and     ecx,033333333h          ;edx = upr 2 bit field counts
        add     ecx,eax                 ;ecx = 4 bit field counts
        mov     eax,ecx
        shr     eax,04h                 ;mov upr 4 bit field counts to lwr
        add     eax,ecx                 ;eax = 8 bit field counts
        and     eax,00f0f0f0fh          ; after the and
        imul    eax,eax,01010101h       ;eax bit 24->28 = bit count
        shr     eax,018h                ;eax bit 0->4 = bit count
        ret
popcntx endp

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  死守一世寂寞        
                
              
                            
                2020-12-21 11:53
              
            
            
                                                                       
You are starting out with a really inefficient algorithm - if you use a better algorithm then you may not need to waste time with assembler. See Hacker's Delight and/or Bit Twiddling Hacks for much more efficient methods.

Note also that newer x86 CPUs have a POPCNT instruction which does all of the above in one instruction (and you can call it via an intrinsic too, so no need for asm).

And finally gcc has a builtin: __builtin_popcount, which again does all you need - it will use POPCNT on newer CPUs and equivalent asm on older CPUs.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  感动是毒        
                
              
                            
                2020-12-21 11:54
              
            
            
                                                                       
When I needed to create a popcount, I ended up using the 5's and 3's method from the Bit Twiddling Hacks @PaulR mentioned.  But if I wanted to do this with a loop, maybe something like this:

#include <stdio.h>
#include <stdlib.h>

int popcount2(int v) {
   int result = 0;
   int junk;

   asm (
        "shr $1, %[v]      \n\t"   // shift low bit into CF
        "jz done           \n"     // and skip the loop if that was the only set bit
     "start:               \n\t"
        "adc $0, %[result] \n\t"   // add CF (0 or 1) to result
        "shr $1, %[v]      \n\t"
        "jnz start         \n"     // leave the loop after shifting out the last bit
     "done:                \n\t"
        "adc $0, %[result] \n\t"   // and add that last bit

        : [result] "+r" (result), "=r" (junk)
        : [v] "1" (v)
        : "cc"
   );

   return result;
}

int main(int argc, char *argv[])
{
   for (int x=0; x < argc-1; x++)
   {
      int v = atoi(argv[x+1]);

      printf("%d %d\n", v, popcount2(v));
   }
}


adc is almost always more efficient than branching on CF.

"=r" (junk) is a dummy output operand that is in the same register as v (the "1" constraint).  We're using this to tell the compiler that the asm statement destroys the v input.  We could have used [v] "+r"(v) to get a read-write operand, but we don't want the C variable v to be updated.

Note that the loop trip-count for this implementation is the position of the highest set bit.  (bsr, or 32 - clz(v)).  @rcgldr's implementation which clears the lowest set bit every iteration will typically be faster when the number of set bits is low but they're not all near the bottom of the integer.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复