Regex to match specific functions and their arguments in files

前端 未结 6 1805
没有蜡笔的小新
没有蜡笔的小新 2021-01-16 07:33

I\'m working on a gettext javascript parser and I\'m stuck on the parsing regex.

I need to catch every argument passed to a specific method call _n( and

相关标签:
6条回答
  • 2021-01-16 07:46

    Note: Read this answer if you're not familiar with recursion.

    Part 1: match specific functions

    Who said that regex can't be modular? Well PCRE regex to the rescue!

    ~                      # Delimiter
    (?(DEFINE)             # Start of definitions
       (?P<str_double_quotes>
          (?<!\\)          # Not escaped
          "                # Match a double quote
          (?:              # Non-capturing group
             [^\\]         # Match anything not a backslash
             |             # Or
             \\.           # Match a backslash and a single character (ie: an escaped character)
          )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
          "                # Match the ending double quote
       )
    
       (?P<str_single_quotes>
          (?<!\\)          # Not escaped
          '                # Match a single quote
          (?:              # Non-capturing group
             [^\\]         # Match anything not a backslash
             |             # Or
             \\.           # Match a backslash and a single character (ie: an escaped character)
          )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
          '                # Match the ending single quote
       )
    
       (?P<brackets>
          \(                          # Match an opening bracket
             (?:                      # A non capturing group
                (?&str_double_quotes) # Recurse/use the str_double_quotes pattern
                |                     # Or
                (?&str_single_quotes) # Recurse/use the str_single_quotes pattern
                |                     # Or
                [^()]                 # Anything not a bracket
                |                     # Or
                (?&brackets)          # Recurse the bracket pattern
             )*
          \)
       )
    )                                 # End of definitions
    # Let's start matching for real now:
    _n?                               # Match _ or _n
    \s*                               # Optional white spaces
    (?P<results>(?&brackets))         # Recurse/use the brackets pattern and put it in the results group
    ~sx
    

    The s is for matching newlines with . and the x modifier is for this fancy spacing and commenting of our regex.

    Online regex demo Online php demo

    Part 2: getting rid of opening & closing brackets

    Since our regex will also get the opening and closing brackets (), we might need to filter them. We will use preg_replace() on the results:

    ~           # Delimiter
    ^           # Assert begin of string
    \(          # Match an opening bracket
    \s*         # Match optional whitespaces
    |           # Or
    \s*         # Match optional whitespaces
    \)          # Match a closing bracket
    $           # Assert end of string
    ~x
    

    Online php demo

    Part 3: extracting the arguments

    So here's another modular regex, you could even add your own grammar:

    ~                      # Delimiter
    (?(DEFINE)             # Start of definitions
       (?P<str_double_quotes>
          (?<!\\)          # Not escaped
          "                # Match a double quote
          (?:              # Non-capturing group
             [^\\]         # Match anything not a backslash
             |             # Or
             \\.           # Match a backslash and a single character (ie: an escaped character)
          )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
          "                # Match the ending double quote
       )
    
       (?P<str_single_quotes>
          (?<!\\)          # Not escaped
          '                # Match a single quote
          (?:              # Non-capturing group
             [^\\]         # Match anything not a backslash
             |             # Or
             \\.           # Match a backslash and a single character (ie: an escaped character)
          )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
          '                # Match the ending single quote
       )
    
       (?P<array>
          Array\s*
          (?&brackets)
       )
    
       (?P<variable>
          [^\s,()]+        # I don't know the exact grammar for a variable in ECMAScript
       )
    
       (?P<brackets>
          \(                          # Match an opening bracket
             (?:                      # A non capturing group
                (?&str_double_quotes) # Recurse/use the str_double_quotes pattern
                |                     # Or
                (?&str_single_quotes) # Recurse/use the str_single_quotes pattern
                |                     # Or
                (?&array)             # Recurse/use the array pattern
                |                     # Or
                (?&variable)          # Recurse/use the array pattern
                |                     # Or
                [^()]                 # Anything not a bracket
                |                     # Or
                (?&brackets)          # Recurse the bracket pattern
             )*
          \)
       )
    )                                 # End of definitions
    # Let's start matching for real now:
    (?&array)
    |
    (?&variable)
    |
    (?&str_double_quotes)
    |
    (?&str_single_quotes)
    ~xis
    

    We will loop and use preg_match_all(). The final code would look like this:

    $functionPattern = <<<'regex'
    ~                      # Delimiter
    (?(DEFINE)             # Start of definitions
       (?P<str_double_quotes>
          (?<!\\)          # Not escaped
          "                # Match a double quote
          (?:              # Non-capturing group
             [^\\]         # Match anything not a backslash
             |             # Or
             \\.           # Match a backslash and a single character (ie: an escaped character)
          )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
          "                # Match the ending double quote
       )
    
       (?P<str_single_quotes>
          (?<!\\)          # Not escaped
          '                # Match a single quote
          (?:              # Non-capturing group
             [^\\]         # Match anything not a backslash
             |             # Or
             \\.           # Match a backslash and a single character (ie: an escaped character)
          )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
          '                # Match the ending single quote
       )
    
       (?P<brackets>
          \(                          # Match an opening bracket
             (?:                      # A non capturing group
                (?&str_double_quotes) # Recurse/use the str_double_quotes pattern
                |                     # Or
                (?&str_single_quotes) # Recurse/use the str_single_quotes pattern
                |                     # Or
                [^()]                 # Anything not a bracket
                |                     # Or
                (?&brackets)          # Recurse the bracket pattern
             )*
          \)
       )
    )                                 # End of definitions
    # Let's start matching for real now:
    _n?                               # Match _ or _n
    \s*                               # Optional white spaces
    (?P<results>(?&brackets))         # Recurse/use the brackets pattern and put it in the results group
    ~sx
    regex;
    
    
    $argumentsPattern = <<<'regex'
    ~                      # Delimiter
    (?(DEFINE)             # Start of definitions
       (?P<str_double_quotes>
          (?<!\\)          # Not escaped
          "                # Match a double quote
          (?:              # Non-capturing group
             [^\\]         # Match anything not a backslash
             |             # Or
             \\.           # Match a backslash and a single character (ie: an escaped character)
          )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
          "                # Match the ending double quote
       )
    
       (?P<str_single_quotes>
          (?<!\\)          # Not escaped
          '                # Match a single quote
          (?:              # Non-capturing group
             [^\\]         # Match anything not a backslash
             |             # Or
             \\.           # Match a backslash and a single character (ie: an escaped character)
          )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
          '                # Match the ending single quote
       )
    
       (?P<array>
          Array\s*
          (?&brackets)
       )
    
       (?P<variable>
          [^\s,()]+        # I don't know the exact grammar for a variable in ECMAScript
       )
    
       (?P<brackets>
          \(                          # Match an opening bracket
             (?:                      # A non capturing group
                (?&str_double_quotes) # Recurse/use the str_double_quotes pattern
                |                     # Or
                (?&str_single_quotes) # Recurse/use the str_single_quotes pattern
                |                     # Or
                (?&array)             # Recurse/use the array pattern
                |                     # Or
                (?&variable)          # Recurse/use the array pattern
                |                     # Or
                [^()]                 # Anything not a bracket
                |                     # Or
                (?&brackets)          # Recurse the bracket pattern
             )*
          \)
       )
    )                                 # End of definitions
    # Let's start matching for real now:
    (?&array)
    |
    (?&str_double_quotes)
    |
    (?&str_single_quotes)
    |
    (?&variable)
    ~six
    regex;
    
    $input = <<<'input'
    _  ("foo") // want "foo"
    _n("bar", "baz", 42); // want "bar", "baz", 42
    _n(domain, "bux", var); // want domain, "bux", var
    _( "one (optional)" ); // want "one (optional)"
    apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples) // could have on the same line two calls..
    
    // misleading cases
    _n("foo (")
    _n("foo (\)", 'foo)', aa)
    _n( Array(1, 2, 3), Array(")",   '(')   );
    _n(function(foo){return foo*2;}); // Is this even valid?
    _n   ();   // Empty
    _ (   
        "Foo",
        'Bar',
        Array(
            "wow",
            "much",
            'whitespaces'
        ),
        multiline
    ); // PCRE is awesome
    input;
    
    if(preg_match_all($functionPattern, $input, $m)){
        $filtered = preg_replace(
            '~          # Delimiter
            ^           # Assert begin of string
            \(          # Match an opening bracket
            \s*         # Match optional whitespaces
            |           # Or
            \s*         # Match optional whitespaces
            \)          # Match a closing bracket
            $           # Assert end of string
            ~x', // Regex
            '', // Replace with nothing
            $m['results'] // Subject
        ); // Getting rid of opening & closing brackets
    
        // Part 3: extract arguments:
        $parsedTree = array();
        foreach($filtered as $arguments){   // Loop
            if(preg_match_all($argumentsPattern, $arguments, $m)){ // If there's a match
                $parsedTree[] = array(
                    'all_arguments' => $arguments,
                    'branches' => $m[0]
                ); // Add an array to our tree and fill it
            }else{
                $parsedTree[] = array(
                    'all_arguments' => $arguments,
                    'branches' => array()
                ); // Add an array with empty branches
            }
        }
    
        print_r($parsedTree); // Let's see the results;
    }else{
        echo 'no matches';
    }
    

    Online php demo

    You might want to create a recursive function to generate a full tree. See this answer.

    You might notice that the function(){} part isn't parsed correctly. I will let that as an exercise for the readers :)

    0 讨论(0)
  • 2021-01-16 07:53

    One bit of code (you can test this PHP code at http://writecodeonline.com/php/ to check):

    $string = '_("foo")
    _n("bar", "baz", 42); 
    _n(domain, "bux", var);
    _( "one (optional)" );
    apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples)';
    
    preg_match_all('/(?<=(_\()|(_n\())[\w", ()%]+(?=\))/i', $string, $matches);
    
    foreach($matches[0] as $test){
        $opArr = explode(',', $test);
        foreach($opArr as $test2){
           echo trim($test2) . "\n";
           }
        }
    

    you can see the initial pattern and how it works here: http://regex101.com/r/fR7eU2/1

    Output is:

    "foo"
    "bar"
    "baz"
    42
    domain
    "bux"
    var
    "one (optional)"
    "No apples"
    "%1 apple"
    "%1 apples"
    apples
    
    0 讨论(0)
  • 2021-01-16 07:56

    Try this:

    (?<=\().*?(?=\s*\)[^)]*$)
    

    See live demo

    0 讨论(0)
  • 2021-01-16 07:57

    Below regex should help you.

    ^(?=\w+\()\w+?\(([\s'!\\\)",\w]+)+\);
    

    Check the demo here

    0 讨论(0)
  • 2021-01-16 08:02

    \(( |"(\\"|[^"])*"|'(\\'|[^'])*'|[^)"'])*?\)

    This should get anything between a pair of parenthesis, ignoring parenthesis in quotes. Explanation:

    \( // Literal open paren
        (
             | //Space or
            "(\\"|[^"])*"| //Anything between two double quotes, including escaped quotes, or
            '(\\'|[^'])*'| //Anything between two single quotes, including escaped quotes, or
            [^)"'] //Any character that isn't a quote or close paren
        )*? // All that, as many times as necessary
    \) // Literal close paren
    

    No matter how you slice it, regular expressions are going to cause problems. They're hard to read, hard to maintain, and highly inefficient. I'm unfamiliar with gettext, but perhaps you could use a for loop?

    // This is just pseudocode.  A loop like this can be more readable, maintainable, and predictable than a regular expression.
    for(int i = 0; i < input.length; i++) {
        // Ignoring anything that isn't an opening paren
        if(input[i] == '(') {
            String capturedText = "";
            // Loop until a close paren is reached, or an EOF is reached
            for(; input[i] != ')' && i < input.length; i++) {
                if(input[i] == '"') {
                    // Loop until an unescaped close quote is reached, or an EOF is reached
                    for(; (input[i] != '"' || input[i - 1] == '\\') && i < input.length; i++) {
                        capturedText += input[i];
                    }
                }
                if(input[i] == "'") {
                    // Loop until an unescaped close quote is reached, or an EOF is reached
                    for(; (input[i] != "'" || input[i - 1] == '\\') && i < input.length; i++) {
                        capturedText += input[i];
                    }
                }
                capturedText += input[i];
            }
            capture(capturedText);
        }
    }
    

    Note: I didn't cover how to determine if it's a function or just a grouping symbol. (ie, this will match a = (b * c)). That's complicated, as is covered in detail here. As your code gets more and more accurate, you get closer and closer to writing your own javascript parser. You might want to take a look at the source code for actual javascript parsers if you need that sort of accuracy.

    0 讨论(0)
  • 2021-01-16 08:06

    We can do this in two steps:

    1)catch all function arguments for _n( or _( method calls

    (?:_\(|_n\()(?:[^()]*\([^()]*\))*[^()]*\)
    

    See demo.

    http://regex101.com/r/oE6jJ1/13

    2)catch the stringy ones only

    "([^"]*)"|(?:\(|,)\s*([^"),]*)(?=,|\))
    

    See demo.

    http://regex101.com/r/oE6jJ1/14

    0 讨论(0)
提交回复
热议问题