Implementing a matcher for the regex '[ab][^r]+r]' in assembly

前端未结

关注

 2  1877

名媛妹妹

I need help with my assembly code. I need to use write code, that will find range, that suit to my regex expression.

My regex: [ab][^r]+r, so first i\'m loo

相关标签:

2条回答

一个人的身影

2021-01-21 20:40
I realize this doesn't qualify as an 'answer' since the assignment requires you to use the specific format provided by your instructor. However since I feel that using inline asm is a poor way to learn asm, I want to show how this would look if you wrote this as pure asm. Trying to cram this into the other (already very long) answer seems like a poor fit.

Instead I propose 2 files. The first is pure C code:
```
#include <stdio.h>

extern int __cdecl DoRegEx(const char *s, int *startpos);

int main(void) 
{
  const char *s = "fqr  b qabxx  xryc pqr";
  int startpos, len;

  len = DoRegEx(s, &startpos);

  printf("%d, %d\n", startpos, len);
  return 0; 
}
```
That's much easier to read/maintain than what you end up with using inline asm. But more importantly, here's the asm file:
```
# foo2.s - Searches for regex "[ab][^r]+r" in string
#
# Called from c with:
#
#    extern int __cdecl DoRegEx(const char *s, int *startpos);
#
# On input:
#
#   [esp+4] is s
#   [esp+8] is pointer to startpos.
#
# On output:
#
#   startpos is the (zero based) offset into (s) where match begins.
#   Length of match (or -1 if match not found) is returned in eax.
#
# __cdecl allows the callee (that's us) to modify any of EAX, ECX, 
# and EDX. All other registers must be returned unchanged.
#

# Use intel syntax
.intel_syntax noprefix

# export our symbol (note __cdecl prepends an underscore to names).
.global _DoRegEx

# Start code segment
.text

_DoRegEx:
   mov ecx, [esp+4] # Load pointer to (s)

Phase1:
   mov dl, [ecx]    # Read next byte

   test dl, dl 
   jz NotFound      # Hit end of string

   inc ecx          # Move to next byte

   cmp dl, 'a'      # Check for 'a'
   je Phase2

   cmp dl, 'b'      # Check for 'b'
   jne Phase1

   ... blah blah blah ...

   mov edx, [esp+8]          # get pointer to startpos
   mov DWORD PTR [edx], ecx  # write startpos

   ret
```
You can compile+link both files at once using gcc -m32 -o foo.exe foo1.c foo2.s.

If you end up working with assembler for a living, it's more likely to look like this than what you see using gcc's extended asm (which is ugly at the best of times). It also deals with common real-world concepts like reading parameters from the stack, preserving registers and using assembler directives (.text, .global, etc). Those things are mostly hidden from you when inlining this into C, but are essential components of working in and understanding assembly language.

FWIW.

PS Did you get your code working? If the other answer gave sufficient information to create your program, don't forget to 'accept' it. If you are stuck again, edit your original post to add your current code, and include a description of what still doesn't work right.
0 讨论(0)
发布评论:

提交评论
- 加载中...
小蘑菇

2021-01-21 20:52
First of all, you are a bit unclear about your requirements. When I first read your post, it looked like you were trying to write a complete "regex" routine in assembler (<blech>). But looking closer, it appears that all you are really doing is "hardcoding" this one, very specific regex search in assembler. This misunderstanding is probably why this question isn't getting any responses.

Second, you should talk to this guy who is apparently in the same class as you. Perhaps you two can share notes.

Third, someone should talk to your instructor about his assignments. Using gcc's "inline asm" to teach asm is probably the hardest possible approach. Does he hate his students? And I'm not impressed with the "template" that he provides which (apparently?) you are not permitted to change. I can see at least a half dozen things I'd change here.

Fourth, you say that the regex string "[ab][^r]+r" should print out 5, 10 for "fqr b qabxx xryc pqr". I'm not sure how that can be. The match does start at (zero-based) 5, but it does not end at position 10:
```
          1         2
0123456789012345678901
fqr  b qabxx  xryc pqr
     ^         ^
   start      end
```
The end is position 15. The matching string (b qabxx xr) is 11 characters long, so apparently you aren't looking for the length. There is a second 'start' that occurs at position 8, a third at position 9, and there are multiple possible endpoints as well. None of which explains where you are supposed to get "10". I'm going to assume you meant 5, 15.

All that said, processing [ab][^r]+r breaks down into essentially 3 parts:
1. [ab] Find either 'a' or 'b'. Exit with error upon encountering end of string if you can't find them.
2. [^r]+ If the letter immediately after (1) is 'r', goto 1.
3. r Walk the rest of the string and exit with success on next 'r', or exit with error at end of string.
If you don't understand why these are the parts, try playing with https://regex101.com/r/E3nI1F/1 (it lets you try various regex search strings to see what it finds).

Looking at your current code, I don't think you handle either (2) or (3) correctly (actually, I don't think you handle them at all). While there are other things I would change in your code, perhaps tuning should wait until the thing works correctly.

Given that this is homework, I'm not keen on just posting my code. If you just copy/paste my work, you aren't going to learn anything.

If you want to edit your question once you have added the work for 2 and 3, I could review this again or give more suggestions. If you post a working copy, I can share mine and we can compare them.

----------- Edit 1 --------------

my teacher does not seem to hate us

Oh? Consider this code (a simplified version of yours):
```
asm volatile (
   "xor %0, %0;"
   "mov %1, %2"
   :"=r" (x), "=r" (y)
   :"r" (s));
```
Seems pretty straight-forward, right? Zero out x, and copy s to y. However, due to something called "early clobber" (see '&' on https://gcc.gnu.org/onlinedocs/gcc/Modifiers.html), it is possible (not guaranteed) that when optimizing, gcc will choose the same register for both %0 and %2 (or maybe %1 and %2). So when you zero out %0, you could also be zeroing out %2.

This can be fixed by adding ampersands to ensure there's no overlap:
```
:"=&r" (x), "=&r" (y)
```
But how are you expected to know this? And knowing this detail doesn't help you learn assembler. It's just a weird quirk about how gcc's inline asm works. If you were writing an actual asm routine (which is what I'd recommend), you'd never need to know this.

And wouldn't this be easier to read if you used symbolic names?
```
asm volatile (
   "xor %[x], %[x];"
   "mov %[y], %[s]"
   : [x] "=&r" (x), [y] "=&r" (y)
   : [s] "r" (s));
```
I find it easier to read. But this is another thing that isn't really about assembly language. It's just a trick about how to shove inline asm into c code when using gcc (something you should almost never do).

What else? Some other issues with this template: The volatile qualifier doesn't belong here. It's missing the "cc" clobber. And the "memory" clobber. And you end up clobbering more registers than you need. Oh and why not just tell people to compile with -masm=intel and avoid that ".intel_syntax noprefix;" and ".att_syntax prefix;" junk (yet more gcc quirks).

Using assembly language can be useful. I'm not trying to say that it isn't. But trying to use gcc's inline asm is filled with quirks. Since functions written in pure assembler can be called from C code, and since that method has none of these issues, I can only conclude that you are being forced to do this because you were mean to him/her and (s)he hates you.

----------- Edit 2 --------------

Since you have posted working code (assuming you have fixed "arb r"), let me offer mine:
```
#include <stdio.h>

int main(int argc, char *argv[]) 
{
  const char *s = "fqr  b qabxx  xryc pqr"; // Succeeds with 5,11

  int x, y;

  // Assumes s is not NULL.
  // Return y=-1 on not found.

  asm volatile (
  ".intel_syntax noprefix\n\t"

     "lea ebx, [%2-1]\n\t"  // ebx is pointer to next character.
     "mov ecx, %2\n\t"      // Save this since we aren't using earlyclobber...
     "mov %1, -1\n"         // ...so at this point, %2 might be dead.

  // Note that ebx starts at s-1.

  "Phase1:\n\t"
     "inc ebx\n\t"
     "mov al, [ebx]\n\t" // Read next byte.

     "test al, al\n\t" 
     "jz Done\n\t"       // End of string.

     // Check for [ab]
     "cmp al, 'a'\n\t" 
     "je Phase2\n\t"

     "cmp al, 'b'\n\t"
     "jne Phase1\n"

     // Phase 2 - Found [ab], check for [^r]+
  "Phase2:\n\t"
     "mov al, byte ptr [ebx+1]\n\t"

     "test al, al\n\t" 
     "jz Done\n\t"     // End of string.

     "cmp al, 'r'\n\t"
     "je Phase1\n\t"   // Found [^r]+, go look for another [ab]

     "mov %0, ebx\n\t"

     // Found [ab], and no [^r]+.  Find r.
     "mov ebx, 1\n"

  "Phase3:\n\t"
     "mov al, [%0 + ebx]\n\t" // Read next byte.
     "inc ebx\n\t"

     "test al, al\n\t" 
     "jz Done\n\t"     // End of string.

     "cmp al, 'r'\n\t"
     "jne Phase3\n\t"

     // Found r.
     "sub %0, ecx\n\t" // Set (x)
     "mov %1, ebx\n"

  "Done:\n"

  ".att_syntax prefix"
  :"=r" (x), "=r" (y)
  :"r" (s)
  :"eax", "ebx", "ecx", "edx"
  );

  printf("%d, %d \n", x, y);
  return 0; 
}
```
It's shorter, and doesn't need as many registers (no edx). While it could be tuned up a bit more, it's a credible solution for a homework problem.

If you were allowed to change the framework, it can be a little bit better:
```
   // Returns y = -1 if no regex match is found.

  __asm__ (
      // ---------------------------------
      // Phase1 - look for [ab]

      "mov %[x], %[s]\n"   // Pointer to next char to read

   "Phase1:\n\t"
      "mov al, [%[x]]\n\t" // Read next byte

      "test al, al\n\t" 
      "jz NotFound\n\t"    // Hit end of string

      "inc %[x]\n\t"

      "cmp al, 'a'\n\t" 
      "je Phase2\n\t"

      "cmp al, 'b'\n\t"
      "jne Phase1\n"

      // ---------------------------------
      // Phase2 - Found [ab], Check for [^r]+
   "Phase2:\n\t"

      // x is pointing to the byte after [ab]
      "mov al, [%[x]]\n\t"  // Read next byte.

      "test al, al\n\t" 
      "jz NotFound\n\t"     // Hit end of string

      "cmp al, 'r'\n\t"
      "je Phase1\n\t"  // Found [^r]+, go look for another [ab]

      // ---------------------------------
      // Phase3 - Found [ab], and no [^r]+.  Now find r.

      // x went 1 too far back in Phase1
      "dec %[x]\n\t"

      // We know there is 1 non-r character after [ab]
      "mov %[y], 1\n"

   "Phase3:\n\t"
      "mov al, [%[x] + %[y]]\n\t" // Read next byte.
      "inc %[y]\n\t"

      "test al, al\n\t" 
      "jz NotFound\n\t"     // End of string.

      "cmp al, 'r'\n\t"
      "jne Phase3\n\t"

      // Found +r.
      "sub %[x], %[s]\n\t"  // Set x to offset
      "jmp Done\n"

   "NotFound:\n\t"
      "mov %[y], -1\n"

   "Done:"

   : [x] "=&r" (x), [y] "=&r" (y)
   : [s] "r" (s)
   : "eax", "cc", "memory");
```
The main changes were:
1. Assumes the code is compiled with -masm=intel.
2. Change from "=r" to "=&r". This guarantees that x, y and s all end up in separate registers.
3. Use symbolic names. Instead of referring to x as %0, we can use the name %[x].
4. Since this code reads memory and modifies flags, I have added the "cc" and "memory" clobbers.
5. Remove unneeded volatile.
This clobbers even fewer registers (only eax). While using registers is not 'bad' (it's hard to do much without them), the more you reserve for use in your asm, the more work the compiler has to do to free up those registers before calling your code. Since x, y and s are already in registers (due to "r"), making use of them simplifies the code.
0 讨论(0)
发布评论:

提交评论
- 加载中...