问题
How do i check whether a character is between 0-9, A-Z, and a-z? I understand that you can use cmp char, 'A' or cmp char, '0', etc. However if i have to check three different ranges, how do i do that?
If i need to check whether 'A'<= C <= 'Z', then i would have to check whether the character value is below A first, and then whether it's less than or equal to Z. But since 0-9 are below A, how do i account for that without messing up logic? The same goes for Z, since a-z are above Z. Posting with my logic I have so far. I feel so dumb for not getting simple stuff, but i'm a beginner and I've been working on this for several days and now I'm having to start over again, so any help would be greatly appreciated.
_asm
{
mov ecx, 127
mov esi, 0
mov ebx,LocalBuffer[esi] ;LocalBuffer is a c++ array
Loop1:
cmp ebx, 'a' ;ebx is the 0'th index value of LocalBuffer
jb notLowercase ;If character value is below 'a'
cmp ebx,'z'
jbe CharCount ;if it's less than or equal to 'z'
cmp ebx,'A'
jb notUpperCase ;If less than 'A', but then won't this discard 0-9?
cmp ebx,'Z'
jb CharCount ;If it's less than 'Z', but what about greater than Z?
cmp ebx,'0'
jb NotDigit ;If less than '0'
cmp ebx,'9'
jb CharCount ;What if it's greater than 9?
notLowerCase:
;DO I LOOP BACK TO LOOP1, MOVE ON TO THE NEXT CHARACTER OR SOMETHING ELSE?
notUpperCase:
;SAME ISSUE AS NotLowerCase
notDigit:
;SAME ISSUE AS LAST 2
CharCount:
;Do something
回答1:
First of all, you can't debug your branching until you fix How to load a single byte from address in assembly - you're loading 4 bytes of characters and comparing that whole 32-bit value against 'a'
and so on. Use movzx instead of mov ebx, LocalBuffer[esi]
because it's a char
array.
If you've been single-stepping your code in the debugger, maybe you've noticed that all 4 bytes of ebx
are non-zero. That's why your cmp/branches aren't working or doing what you expect.
@zx485 explained the general case of a chain of branches to go through until you can definitely accept or reject an input.
But you can also simplify by using efficient range-checks using the unsigned-compare trick. e.g. Reverse-engineering asm using sub / cmp / setbe back to C? My attempt is compiling to branches shows how that works for just the lower-case ASCII range.
Even better, ASCII is conveniently designed so the A-Z and a-z ranges align with each other, and don't cross a %32
boundary, so you can force a byte to lower-case with c |= 0x20
, or to upper case with c ^= ~0x20
. Then you only have that one range to check for alphabetic characters.
OR
with 20h forces upper-case characters to lower-case, and doesn't make any non-alphabetic characters into lowercase, so you can do that on a copy of your register.
See What is the idea behind ^= 32, that converts lowercase letters to upper and vice versa? and especially How to access a char array and change lower case letters to upper case, and vice versa for MSVC inline asm that loops over a char array and checks for alphabetic or not.
Make sure you don't destroy your only copy because you still need to count upper separately from lower; you're just creating a temporary to branch on. Unless you want to avoid unused positions in your count array, then maybe you want c - 'A'
as your array index. But probably not if you have one array for all characters and digits you want to count.
Example
For the loop structure, I have out-of-range characters jump over the Do Something
part, reaching the compare/branch loop condition. The load and index increment happens every iteration, regardless of the loaded character.
Note that every character that's not in any of the ranges is a non-digit and a non-letter. It doesn't make sense to have a non-digit branch target separate from a non-letter branch target because that's not what you're figuring out. You could have digit and letter branch to separate places, though.
_asm
{
xor esi, esi ; i=0
Loop1: ; do {
; load from the array *inside* the loop.
movzx ebx, byte ptr LocalBuffer[esi]
inc esi ; ebp = buf[i++]
; check for digits first
lea eax, [ebx - '0']
cmp al, 9
jbe CharCount ; if (c-'0' <= 9) goto CharCount
; non-digits fall through into checking for alphabetic
mov eax, ebx
or eax, 20h ; force to lower-case
sub eax, 'a' ; subtract start of the range
cmp al, 'z'-'a' ; see if it was inside the length of the range (unsigned)
ja skipCount
; in the common case (alphabetic characters), fall through into CharCount
CharCount:
; EBX still holds the character value, zero-extended
add byte ptr [counts + ebx], 1 ;Do something
; or use [counts + ebx*4] if you have an int array.
skipCount: ; rejected characters jump here, skipping count increment
cmp esi, 127
jb Loop1 ; } while(i<127)
}
You don't need to waste a 2nd register on another loop counter (ECX) when you already have ESI. cmp/jb
is more efficient than the loop
instruction anyway.
I think we can save one instruction by doing the subtract first (so we can still use lea
to copy-and-subtract), but then we have to clear the 0x20 bit instead of setting it so we're dealing with upper-case.
;; untested, but I think this is correct, too, using LEA+AND instead of MOV+OR+SUB
lea eax, [ebx - 'A']
and eax, ~20h ; clear the lower-case bit
cmp al, 'Z'-'A' ; 25, same as 'z'-'a' of course.
ja skipCount
c - 'A'
= 0x20 for c='a'
. Character codes past 'Z'
but before 'a'
produce smaller results so clearing the 0x20
bit can't give us a false-positive.
PS: if this is the same histogram problem you asked previous questions about, you don't need to filter while reading, just make your array of counts have 256 elements (for every possible uint8_t
value) and then only loop over the ones you want to print.
If you were getting segfaults using ebx
as the index, that's because you loaded 4 bytes (a large integer) instead of zero-extending one. We already fixed this bug in previous versions of you question.
Also, as I previously explained in comments, you don't need to copy your string input to a LocalBuffer
, just do char *bufptr = Buffer;
and in inline asm do mov esi, bufptr
to get that pointer into a register. That's inefficient, but much better than copying a whole array. Especially for counts as well.
Or https://godbolt.org/z/QszVMf shows how to access class members from inline asm.
回答2:
An easy approach is to order the ranges in an ascending (or descending) way. Then you can use the cmp
s in an ON/OFF style:
mov ecx, 127 ; Check a 127 char string
mov esi, 0
Loop1:
movzx ebx, byte ptr LocalBuffer[esi] ; Load a byte from the address
cmp bl, '0' ; '0' = 48 - all lower values mask are NOT IN THE SET
jb notInSet ;
cmp bl,'9' ; '9' = 57 - all lower are IN THE SET
jbe CharCount ; It is a number
cmp bl,'A' ; 'A' = 65 - all lower are NOT IN THE SET
jb notInSet ; If less than 'A'
cmp bl,'Z' ; 'Z' = 90 - all lower are IN THE SET
jbe CharCount ; It is an uppercase char
cmp bl,'a' ; 'a' = 97 - all lower are NOT IN THE SET
jb NotInSet ;
cmp bl,'z' ; 'z' = 122 - all lower are IN THE SET
jbe CharCount ; It is a lowercase letter
; FALL THROUGH for greater values
notInSet:
inc esi
loop Loop1
jmp Final
CharCount:
; DO SOMETHING (that doesn't mess up ECX, ESI)
inc esi
loop Loop1
; FALL THROUGH to Final
Final:
; END of this snippet
As you can see, the values that are checked do ascend. For example, a value of 3
(=51) will first check if it is below 48 (=NO), then check if it is below 57 (=YES), so the second jump is taken.
An alternative is using a jump table with indexed addressing. In this approach you define the ranges as a table of boolean values (0=NotInSet, 1=CharCount):
The table should be set up like this in the .data
segment for your scenario (note the alternating values of 0
and 1
, the ON/OFF style mentioned above):
.data
JumpTable db 48 dup(0), 10 dup(1), 7 dup(0), 26 dup(1), 7 dup(0), 26 dup(1), 133 dup(0)
Then the code could look like this:
mov ecx, 127
mov esi, 0
Loop1:
movzx ebx, byte ptr LocalBuffer[esi] ; Load a byte from the address
movzx eax, byte ptr JumpTable[ebx] ; Retrieve the ebx'th value of the table[
test eax, eax ; Check if it's zero
jnz CharCount ; If it's not, it's a char, so jump to CharCount
; FALL THROUGH TO notInSet
notInSet:
inc esi
loop Loop1
jmp Final
CharCount:
; DO SOMETHING (that doesn't mess up ECX, ESI)
inc esi
loop Loop1
; FALL THROUGH to Final
Final:
; END of this snippet
The table has 256 values, the full ASCII range, either 0 or 1.
In both cases you can move the inc esi
to the beginning, right after the value has been read by movzx ebx, byte ptr LocalBuffer[esi]
.
来源:https://stackoverflow.com/questions/58791457/how-do-i-check-whether-characters-are-within-certain-ascii-value-ranges