Implementing a matcher for the regex '[ab][^r]+r]' in assembly

前端 未结 2 1878
名媛妹妹
名媛妹妹 2021-01-21 20:36

I need help with my assembly code. I need to use write code, that will find range, that suit to my regex expression.

My regex: [ab][^r]+r, so first i\'m loo

2条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2021-01-21 20:52

    First of all, you are a bit unclear about your requirements. When I first read your post, it looked like you were trying to write a complete "regex" routine in assembler (). But looking closer, it appears that all you are really doing is "hardcoding" this one, very specific regex search in assembler. This misunderstanding is probably why this question isn't getting any responses.

    Second, you should talk to this guy who is apparently in the same class as you. Perhaps you two can share notes.

    Third, someone should talk to your instructor about his assignments. Using gcc's "inline asm" to teach asm is probably the hardest possible approach. Does he hate his students? And I'm not impressed with the "template" that he provides which (apparently?) you are not permitted to change. I can see at least a half dozen things I'd change here.

    Fourth, you say that the regex string "[ab][^r]+r" should print out 5, 10 for "fqr b qabxx xryc pqr". I'm not sure how that can be. The match does start at (zero-based) 5, but it does not end at position 10:

              1         2
    0123456789012345678901
    fqr  b qabxx  xryc pqr
         ^         ^
       start      end
    

    The end is position 15. The matching string (b qabxx xr) is 11 characters long, so apparently you aren't looking for the length. There is a second 'start' that occurs at position 8, a third at position 9, and there are multiple possible endpoints as well. None of which explains where you are supposed to get "10". I'm going to assume you meant 5, 15.

    All that said, processing [ab][^r]+r breaks down into essentially 3 parts:

    1. [ab] Find either 'a' or 'b'. Exit with error upon encountering end of string if you can't find them.
    2. [^r]+ If the letter immediately after (1) is 'r', goto 1.
    3. r Walk the rest of the string and exit with success on next 'r', or exit with error at end of string.

    If you don't understand why these are the parts, try playing with https://regex101.com/r/E3nI1F/1 (it lets you try various regex search strings to see what it finds).

    Looking at your current code, I don't think you handle either (2) or (3) correctly (actually, I don't think you handle them at all). While there are other things I would change in your code, perhaps tuning should wait until the thing works correctly.

    Given that this is homework, I'm not keen on just posting my code. If you just copy/paste my work, you aren't going to learn anything.

    If you want to edit your question once you have added the work for 2 and 3, I could review this again or give more suggestions. If you post a working copy, I can share mine and we can compare them.

    ----------- Edit 1 --------------

    my teacher does not seem to hate us

    Oh? Consider this code (a simplified version of yours):

    asm volatile (
       "xor %0, %0;"
       "mov %1, %2"
       :"=r" (x), "=r" (y)
       :"r" (s));
    

    Seems pretty straight-forward, right? Zero out x, and copy s to y. However, due to something called "early clobber" (see '&' on https://gcc.gnu.org/onlinedocs/gcc/Modifiers.html), it is possible (not guaranteed) that when optimizing, gcc will choose the same register for both %0 and %2 (or maybe %1 and %2). So when you zero out %0, you could also be zeroing out %2.

    This can be fixed by adding ampersands to ensure there's no overlap:

    :"=&r" (x), "=&r" (y)
    

    But how are you expected to know this? And knowing this detail doesn't help you learn assembler. It's just a weird quirk about how gcc's inline asm works. If you were writing an actual asm routine (which is what I'd recommend), you'd never need to know this.

    And wouldn't this be easier to read if you used symbolic names?

    asm volatile (
       "xor %[x], %[x];"
       "mov %[y], %[s]"
       : [x] "=&r" (x), [y] "=&r" (y)
       : [s] "r" (s));
    

    I find it easier to read. But this is another thing that isn't really about assembly language. It's just a trick about how to shove inline asm into c code when using gcc (something you should almost never do).

    What else? Some other issues with this template: The volatile qualifier doesn't belong here. It's missing the "cc" clobber. And the "memory" clobber. And you end up clobbering more registers than you need. Oh and why not just tell people to compile with -masm=intel and avoid that ".intel_syntax noprefix;" and ".att_syntax prefix;" junk (yet more gcc quirks).

    Using assembly language can be useful. I'm not trying to say that it isn't. But trying to use gcc's inline asm is filled with quirks. Since functions written in pure assembler can be called from C code, and since that method has none of these issues, I can only conclude that you are being forced to do this because you were mean to him/her and (s)he hates you.

    ----------- Edit 2 --------------

    Since you have posted working code (assuming you have fixed "arb r"), let me offer mine:

    #include 
    
    int main(int argc, char *argv[]) 
    {
      const char *s = "fqr  b qabxx  xryc pqr"; // Succeeds with 5,11
    
      int x, y;
    
      // Assumes s is not NULL.
      // Return y=-1 on not found.
    
      asm volatile (
      ".intel_syntax noprefix\n\t"
    
         "lea ebx, [%2-1]\n\t"  // ebx is pointer to next character.
         "mov ecx, %2\n\t"      // Save this since we aren't using earlyclobber...
         "mov %1, -1\n"         // ...so at this point, %2 might be dead.
    
      // Note that ebx starts at s-1.
    
      "Phase1:\n\t"
         "inc ebx\n\t"
         "mov al, [ebx]\n\t" // Read next byte.
    
         "test al, al\n\t" 
         "jz Done\n\t"       // End of string.
    
         // Check for [ab]
         "cmp al, 'a'\n\t" 
         "je Phase2\n\t"
    
         "cmp al, 'b'\n\t"
         "jne Phase1\n"
    
         // Phase 2 - Found [ab], check for [^r]+
      "Phase2:\n\t"
         "mov al, byte ptr [ebx+1]\n\t"
    
         "test al, al\n\t" 
         "jz Done\n\t"     // End of string.
    
         "cmp al, 'r'\n\t"
         "je Phase1\n\t"   // Found [^r]+, go look for another [ab]
    
         "mov %0, ebx\n\t"
    
         // Found [ab], and no [^r]+.  Find r.
         "mov ebx, 1\n"
    
      "Phase3:\n\t"
         "mov al, [%0 + ebx]\n\t" // Read next byte.
         "inc ebx\n\t"
    
         "test al, al\n\t" 
         "jz Done\n\t"     // End of string.
    
         "cmp al, 'r'\n\t"
         "jne Phase3\n\t"
    
         // Found r.
         "sub %0, ecx\n\t" // Set (x)
         "mov %1, ebx\n"
    
      "Done:\n"
    
      ".att_syntax prefix"
      :"=r" (x), "=r" (y)
      :"r" (s)
      :"eax", "ebx", "ecx", "edx"
      );
    
      printf("%d, %d \n", x, y);
      return 0; 
    }
    

    It's shorter, and doesn't need as many registers (no edx). While it could be tuned up a bit more, it's a credible solution for a homework problem.

    If you were allowed to change the framework, it can be a little bit better:

       // Returns y = -1 if no regex match is found.
    
      __asm__ (
          // ---------------------------------
          // Phase1 - look for [ab]
    
          "mov %[x], %[s]\n"   // Pointer to next char to read
    
       "Phase1:\n\t"
          "mov al, [%[x]]\n\t" // Read next byte
    
          "test al, al\n\t" 
          "jz NotFound\n\t"    // Hit end of string
    
          "inc %[x]\n\t"
    
          "cmp al, 'a'\n\t" 
          "je Phase2\n\t"
    
          "cmp al, 'b'\n\t"
          "jne Phase1\n"
    
          // ---------------------------------
          // Phase2 - Found [ab], Check for [^r]+
       "Phase2:\n\t"
    
          // x is pointing to the byte after [ab]
          "mov al, [%[x]]\n\t"  // Read next byte.
    
          "test al, al\n\t" 
          "jz NotFound\n\t"     // Hit end of string
    
          "cmp al, 'r'\n\t"
          "je Phase1\n\t"  // Found [^r]+, go look for another [ab]
    
          // ---------------------------------
          // Phase3 - Found [ab], and no [^r]+.  Now find r.
    
          // x went 1 too far back in Phase1
          "dec %[x]\n\t"
    
          // We know there is 1 non-r character after [ab]
          "mov %[y], 1\n"
    
       "Phase3:\n\t"
          "mov al, [%[x] + %[y]]\n\t" // Read next byte.
          "inc %[y]\n\t"
    
          "test al, al\n\t" 
          "jz NotFound\n\t"     // End of string.
    
          "cmp al, 'r'\n\t"
          "jne Phase3\n\t"
    
          // Found +r.
          "sub %[x], %[s]\n\t"  // Set x to offset
          "jmp Done\n"
    
       "NotFound:\n\t"
          "mov %[y], -1\n"
    
       "Done:"
    
       : [x] "=&r" (x), [y] "=&r" (y)
       : [s] "r" (s)
       : "eax", "cc", "memory");
    

    The main changes were:

    1. Assumes the code is compiled with -masm=intel.
    2. Change from "=r" to "=&r". This guarantees that x, y and s all end up in separate registers.
    3. Use symbolic names. Instead of referring to x as %0, we can use the name %[x].
    4. Since this code reads memory and modifies flags, I have added the "cc" and "memory" clobbers.
    5. Remove unneeded volatile.

    This clobbers even fewer registers (only eax). While using registers is not 'bad' (it's hard to do much without them), the more you reserve for use in your asm, the more work the compiler has to do to free up those registers before calling your code. Since x, y and s are already in registers (due to "r"), making use of them simplifies the code.

提交回复
热议问题