How do i check whether characters are within certain ascii value ranges?

问题

How do i check whether a character is between 0-9, A-Z, and a-z? I understand that you can use cmp char, 'A' or cmp char, '0', etc. However if i have to check three different ranges, how do i do that?

If i need to check whether 'A'<= C <= 'Z', then i would have to check whether the character value is below A first, and then whether it's less than or equal to Z. But since 0-9 are below A, how do i account for that without messing up logic? The same goes for Z, since a-z are above Z. Posting with my logic I have so far. I feel so dumb for not getting simple stuff, but i'm a beginner and I've been working on this for several days and now I'm having to start over again, so any help would be greatly appreciated.

_asm
{
   mov ecx, 127
   mov esi, 0
   mov ebx,LocalBuffer[esi] ;LocalBuffer is a c++ array 

Loop1:
   cmp ebx, 'a'     ;ebx is the 0'th index value of LocalBuffer
   jb notLowercase  ;If character value is below 'a'
   cmp ebx,'z'
   jbe CharCount    ;if it's less than or equal to 'z' 
   cmp ebx,'A'
   jb notUpperCase ;If less than 'A', but then won't this discard 0-9?
   cmp ebx,'Z'
   jb CharCount    ;If it's less than 'Z', but what about greater than Z?
   cmp ebx,'0'
   jb NotDigit     ;If less than '0'
   cmp ebx,'9'
   jb CharCount    ;What if it's greater than 9?


notLowerCase:  
;DO I LOOP BACK TO LOOP1, MOVE ON TO THE NEXT CHARACTER OR SOMETHING ELSE? 

notUpperCase:
;SAME ISSUE AS NotLowerCase

notDigit:
;SAME ISSUE AS LAST 2

CharCount:
;Do something

回答1:

First of all, you can't debug your branching until you fix How to load a single byte from address in assembly - you're loading 4 bytes of characters and comparing that whole 32-bit value against 'a' and so on. Use movzx instead of mov ebx, LocalBuffer[esi] because it's a char array.

If you've been single-stepping your code in the debugger, maybe you've noticed that all 4 bytes of ebx are non-zero. That's why your cmp/branches aren't working or doing what you expect.

@zx485 explained the general case of a chain of branches to go through until you can definitely accept or reject an input.

But you can also simplify by using efficient range-checks using the unsigned-compare trick. e.g. Reverse-engineering asm using sub / cmp / setbe back to C? My attempt is compiling to branches shows how that works for just the lower-case ASCII range.

Even better, ASCII is conveniently designed so the A-Z and a-z ranges align with each other, and don't cross a %32 boundary, so you can force a byte to lower-case with c |= 0x20, or to upper case with c ^= ~0x20. Then you only have that one range to check for alphabetic characters.

OR with 20h forces upper-case characters to lower-case, and doesn't make any non-alphabetic characters into lowercase, so you can do that on a copy of your register.

See What is the idea behind ^= 32, that converts lowercase letters to upper and vice versa? and especially How to access a char array and change lower case letters to upper case, and vice versa for MSVC inline asm that loops over a char array and checks for alphabetic or not.

Make sure you don't destroy your only copy because you still need to count upper separately from lower; you're just creating a temporary to branch on. Unless you want to avoid unused positions in your count array, then maybe you want c - 'A' as your array index. But probably not if you have one array for all characters and digits you want to count.

Example

For the loop structure, I have out-of-range characters jump over the Do Something part, reaching the compare/branch loop condition. The load and index increment happens every iteration, regardless of the loaded character.

Note that every character that's not in any of the ranges is a non-digit and a non-letter. It doesn't make sense to have a non-digit branch target separate from a non-letter branch target because that's not what you're figuring out. You could have digit and letter branch to separate places, though.

_asm
{
   xor  esi, esi   ; i=0

Loop1:                           ; do {
   ; load from the array *inside* the loop.
   movzx ebx, byte ptr LocalBuffer[esi]
   inc   esi                          ; ebp = buf[i++]

 ; check for digits first
   lea   eax, [ebx - '0']
   cmp   al, 9
   jbe   CharCount                    ; if (c-'0' <= 9) goto CharCount
 ; non-digits fall through into checking for alphabetic

   mov   eax, ebx
   or    eax, 20h       ; force to lower-case
   sub   eax, 'a'       ; subtract start of the range
   cmp   al, 'z'-'a'    ; see if it was inside the length of the range (unsigned)
   ja    skipCount
; in the common case (alphabetic characters), fall through into CharCount

CharCount:
; EBX still holds the character value, zero-extended
   add  byte ptr [counts + ebx], 1       ;Do something
    ; or use  [counts + ebx*4] if you have an int array.

skipCount:    ; rejected characters jump here, skipping count increment
   cmp  esi, 127
   jb   Loop1               ; } while(i<127)
}

You don't need to waste a 2nd register on another loop counter (ECX) when you already have ESI. cmp/jb is more efficient than the loop instruction anyway.

I think we can save one instruction by doing the subtract first (so we can still use lea to copy-and-subtract), but then we have to clear the 0x20 bit instead of setting it so we're dealing with upper-case.

;; untested, but I think this is correct, too, using LEA+AND instead of MOV+OR+SUB
   lea   eax, [ebx - 'A']
   and   eax, ~20h        ; clear the lower-case bit
   cmp   al, 'Z'-'A'      ; 25, same as 'z'-'a' of course.
   ja    skipCount

c - 'A' = 0x20 for c='a'. Character codes past 'Z' but before 'a' produce smaller results so clearing the 0x20 bit can't give us a false-positive.

PS: if this is the same histogram problem you asked previous questions about, you don't need to filter while reading, just make your array of counts have 256 elements (for every possible uint8_t value) and then only loop over the ones you want to print.

If you were getting segfaults using ebx as the index, that's because you loaded 4 bytes (a large integer) instead of zero-extending one. We already fixed this bug in previous versions of you question.

Also, as I previously explained in comments, you don't need to copy your string input to a LocalBuffer, just do char *bufptr = Buffer; and in inline asm do mov esi, bufptr to get that pointer into a register. That's inefficient, but much better than copying a whole array. Especially for counts as well.

Or https://godbolt.org/z/QszVMf shows how to access class members from inline asm.

回答2:

An easy approach is to order the ranges in an ascending (or descending) way. Then you can use the cmps in an ON/OFF style:

   mov ecx, 127    ; Check a 127 char string
   mov esi, 0
Loop1:
   movzx ebx, byte ptr LocalBuffer[esi]   ; Load a byte from the address  
   cmp bl, '0'     ; '0' = 48 - all lower values mask are NOT IN THE SET
   jb  notInSet    ; 
   cmp bl,'9'      ; '9' = 57 - all lower are IN THE SET
   jbe CharCount   ; It is a number 
   cmp bl,'A'      ; 'A' = 65 - all lower are NOT IN THE SET
   jb  notInSet    ; If less than 'A'
   cmp bl,'Z'      ; 'Z' = 90 - all lower are IN THE SET
   jbe CharCount   ; It is an uppercase char
   cmp bl,'a'      ; 'a' = 97 - all lower are NOT IN THE SET
   jb  NotInSet    ; 
   cmp bl,'z'      ; 'z' = 122 - all lower are IN THE SET
   jbe CharCount   ; It is a lowercase letter
   ; FALL THROUGH for greater values
notInSet:  
   inc esi
   loop Loop1
   jmp Final

CharCount:
   ; DO SOMETHING (that doesn't mess up ECX, ESI)
   inc esi
   loop Loop1
   ; FALL THROUGH to Final

Final:
   ; END of this snippet

As you can see, the values that are checked do ascend. For example, a value of 3 (=51) will first check if it is below 48 (=NO), then check if it is below 57 (=YES), so the second jump is taken.

An alternative is using a jump table with indexed addressing. In this approach you define the ranges as a table of boolean values (0=NotInSet, 1=CharCount):

The table should be set up like this in the .data segment for your scenario (note the alternating values of 0 and 1, the ON/OFF style mentioned above):

.data
  JumpTable db 48 dup(0), 10 dup(1), 7 dup(0), 26 dup(1), 7 dup(0), 26 dup(1), 133 dup(0)

Then the code could look like this:

   mov ecx, 127
   mov esi, 0
Loop1:
   movzx ebx, byte ptr LocalBuffer[esi]   ; Load a byte from the address  
   movzx eax, byte ptr JumpTable[ebx]     ; Retrieve the ebx'th value of the table[
   test eax, eax    ; Check if it's zero
   jnz  CharCount   ; If it's not, it's a char, so jump to CharCount 
   ; FALL THROUGH TO notInSet
notInSet:  
   inc esi
   loop Loop1
   jmp Final

CharCount:
   ; DO SOMETHING (that doesn't mess up ECX, ESI)
   inc esi
   loop Loop1
   ; FALL THROUGH to Final

Final:
   ; END of this snippet

The table has 256 values, the full ASCII range, either 0 or 1.

In both cases you can move the inc esi to the beginning, right after the value has been read by movzx ebx, byte ptr LocalBuffer[esi].

来源：https://stackoverflow.com/questions/58791457/how-do-i-check-whether-characters-are-within-certain-ascii-value-ranges

标签

assembly

x86

ascii

MASM

range-checking