Converting Uppercase to Lowercase in Assembly issue

问题

I'm writing to convert a pre-set string from Uppercase to Lowercase. I'm currently moving what is at the address to an 8bit register, then doing a very sloppy way of testing the ASCII value to see if it's Uppercase. Is there a cleaner way to go about it?

Right now I'm subtracting 65 from the ASCII value and comparing to 25. Since uppercase is ASCII (dec) 65-90, any uppercase letters will result in 0-25.

    .DATA
string  DB   "ATest This String?.,/[}", '$'
strSize DD  23
.CODE
strToLower  PROC
        LEA     EAX, string
        PUSH    EAX
        CALL    toLower2    ; write toLower2
        POP EAX
        LEA EAX, string     ; return char* to C++
        RET
strToLower  ENDP

;---------------------------------------------
;Procedure: Convert to LowerCase
;Input: Address in EBX
;       unsigned in AL for each letter
;Output: EAX will contain new string
;---------------------------------------------

toLower2    PROC    ;65-90 is upper, 97-122 is lower (XOR 32?)
            LEA EBX, string
            MOVE ECX, strSize
            PUSH AL     ; PUSH AL before manipulating it
loop1:      MOV AL, [EBX]   ; Put char into AL to manipulate
            XOR BL, BL          ;?????????????
            MOV BL, AL          ;Set condition here???
            SUB BL, 65          ;?????????????
            CMP BL, 25          ;if(i > 64 && < 91) i += 32;
            JA  NoCap           ;
            ADD AL, 32          ;Adds 32 to ASCII value, making lower 
NoCap:      MOV [EBX], AL
            INC EBX
            LOOP loop1
            POP AL      ;Replace/POP AL
            LEA EAX, string
toLower2    ENDP
            END

回答1:

SUB and then an unsigned compare is a good way to check for a inputs being within a certain range using only one conditional branch, instead of separate compare-and-branches for >= 'A' and <= 'Z'.

Compilers use this trick when possible. See also Agner Fog's Optimizing Assembly guide, and other links in the x86 tag wiki for more stuff about writing efficient asm.

You can even use it to detect alphabetic characters (lower or upper case) with one branch: OR with 0x20 will make any upper-case character lower-case, but won't make any non-alphabetic characters alphabetic. So do that, then use the unsigned-compare trick to check for being in the lower-case range. (Or start with AND with ~0x20 to clear that bit, forcing upper-case). I used this trick in an answer on flipping the case of alphabetic characters while leaving other characters alone.

And yes, as you noticed, ASCII is designed so the difference between upper/lower case for every letter is just flipping one bit. Every lowercase character has 0x20 set, while uppercase has it cleared. AND/OR/XOR are typically preferable for doing this (vs. ADD/SUB), because you can sometimes take advantage of not caring about the initial state, when forcing to one case.

Your code has some weird stuff: PUSH AL doesn't even assemble with most assemblers, since the minimum size for push/pop is 16 bits. There's also no point to saving/restoring AL, because you clobber the whole of EAX right after restoring AL after the loop!

Also, MOV just overwrites its destination, so there's no need to xor bl,bl.

Also, you use BL as a scratch register, but it's the low byte of EBX (which you use as a pointer!)

Here's how I might do it, using only EAX, ECX and EDX so I don't have to save/restore any registers. (Your function clobbers EBX, which most 32 and 64-bit calling conventions require functions to save/restore). I'd need an extra register if string wasn't statically allocated, letting me use its address as an immediate constant.

toLower2    PROC    ;65-90 is upper, 97-122 is lower (XOR 32?)
            mov   edx, OFFSET string   ; don't need LEA for this, and mov is slightly more efficient
            add   edx, strSize         ; This should really be an equ definition, not a load from memory.

            ; edx starts at one-past-the-end, and we loop back to the start
loop1:
            dec   edx
            movzx eax, byte [edx]      ; mov al, [edx] leaving high garbage in EAX is ok, too, but this avoids a partial-register stall when doing the mov+sub in one instruction with LEA
            lea   ecx, [eax - 'A']     ; cl = al-'A', and we don't care about the rest of the register

            cmp    cl, 25              ;if(c >= 'A' && c <= 'Z') c |= 0x20;
            ja    NoCap
            or     al, 0x20            ; tolower
            mov   [edx], al            ; since we're branching anyway, make the store conditional
NoCap:
            cmp   edx, OFFSET string
            ja    loop1

            mov   eax, edx             
toLower2    ENDP

The LOOP instruction is slow, and should be avoided. Just forget it even exists and use whatever loop condition is convenient.

Only doing the store when the character changes makes the code more efficient, because it won't dirty the cache when used on memory that hasn't changed for a while if there's nothing to do.

Instead of ja NoCap, you could do that branchlessly with a cmov. But now I have to ignore my suggestion to prefer AND/OR instead of ADD/SUB, because we can use LEA to add 0x20 without affecting flags, saving us a register.

loop1:
            dec   edx
            movzx eax, byte [edx]      ; mov al, [edx] leaving high garbage in EAX is ok, too, but this avoids a partial-register stall when doing the mov+sub in one instruction with LEA
            lea   ecx, [eax - 'A']     ; cl = al-'A', and we don't care about the rest of the register

            cmp    cl, 25              ;if(c >= 'A' && c <= 'Z') c += 0x20;
            lea   ecx, [eax + 0x20]    ; without affecting flags
            cmovna eax, ecx            ; take the +0x20 version if it was in the uppercase range to start with
            ; al = tolower(al)

            mov   [edx], al
            cmp   edx, OFFSET string
            ja    loop1

来源：https://stackoverflow.com/questions/40366031/converting-uppercase-to-lowercase-in-assembly-issue

标签

assembly

x86

MASM

lowercase