Regex to match specific functions and their arguments in files

前端 未结 6 1806
没有蜡笔的小新
没有蜡笔的小新 2021-01-16 07:33

I\'m working on a gettext javascript parser and I\'m stuck on the parsing regex.

I need to catch every argument passed to a specific method call _n( and

6条回答
  •  -上瘾入骨i
    2021-01-16 07:46

    Note: Read this answer if you're not familiar with recursion.

    Part 1: match specific functions

    Who said that regex can't be modular? Well PCRE regex to the rescue!

    ~                      # Delimiter
    (?(DEFINE)             # Start of definitions
       (?P
          (?
          (?
          \(                          # Match an opening bracket
             (?:                      # A non capturing group
                (?&str_double_quotes) # Recurse/use the str_double_quotes pattern
                |                     # Or
                (?&str_single_quotes) # Recurse/use the str_single_quotes pattern
                |                     # Or
                [^()]                 # Anything not a bracket
                |                     # Or
                (?&brackets)          # Recurse the bracket pattern
             )*
          \)
       )
    )                                 # End of definitions
    # Let's start matching for real now:
    _n?                               # Match _ or _n
    \s*                               # Optional white spaces
    (?P(?&brackets))         # Recurse/use the brackets pattern and put it in the results group
    ~sx
    

    The s is for matching newlines with . and the x modifier is for this fancy spacing and commenting of our regex.

    Online regex demo Online php demo

    Part 2: getting rid of opening & closing brackets

    Since our regex will also get the opening and closing brackets (), we might need to filter them. We will use preg_replace() on the results:

    ~           # Delimiter
    ^           # Assert begin of string
    \(          # Match an opening bracket
    \s*         # Match optional whitespaces
    |           # Or
    \s*         # Match optional whitespaces
    \)          # Match a closing bracket
    $           # Assert end of string
    ~x
    

    Online php demo

    Part 3: extracting the arguments

    So here's another modular regex, you could even add your own grammar:

    ~                      # Delimiter
    (?(DEFINE)             # Start of definitions
       (?P
          (?
          (?
          Array\s*
          (?&brackets)
       )
    
       (?P
          [^\s,()]+        # I don't know the exact grammar for a variable in ECMAScript
       )
    
       (?P
          \(                          # Match an opening bracket
             (?:                      # A non capturing group
                (?&str_double_quotes) # Recurse/use the str_double_quotes pattern
                |                     # Or
                (?&str_single_quotes) # Recurse/use the str_single_quotes pattern
                |                     # Or
                (?&array)             # Recurse/use the array pattern
                |                     # Or
                (?&variable)          # Recurse/use the array pattern
                |                     # Or
                [^()]                 # Anything not a bracket
                |                     # Or
                (?&brackets)          # Recurse the bracket pattern
             )*
          \)
       )
    )                                 # End of definitions
    # Let's start matching for real now:
    (?&array)
    |
    (?&variable)
    |
    (?&str_double_quotes)
    |
    (?&str_single_quotes)
    ~xis
    

    We will loop and use preg_match_all(). The final code would look like this:

    $functionPattern = <<<'regex'
    ~                      # Delimiter
    (?(DEFINE)             # Start of definitions
       (?P
          (?
          (?
          \(                          # Match an opening bracket
             (?:                      # A non capturing group
                (?&str_double_quotes) # Recurse/use the str_double_quotes pattern
                |                     # Or
                (?&str_single_quotes) # Recurse/use the str_single_quotes pattern
                |                     # Or
                [^()]                 # Anything not a bracket
                |                     # Or
                (?&brackets)          # Recurse the bracket pattern
             )*
          \)
       )
    )                                 # End of definitions
    # Let's start matching for real now:
    _n?                               # Match _ or _n
    \s*                               # Optional white spaces
    (?P(?&brackets))         # Recurse/use the brackets pattern and put it in the results group
    ~sx
    regex;
    
    
    $argumentsPattern = <<<'regex'
    ~                      # Delimiter
    (?(DEFINE)             # Start of definitions
       (?P
          (?
          (?
          Array\s*
          (?&brackets)
       )
    
       (?P
          [^\s,()]+        # I don't know the exact grammar for a variable in ECMAScript
       )
    
       (?P
          \(                          # Match an opening bracket
             (?:                      # A non capturing group
                (?&str_double_quotes) # Recurse/use the str_double_quotes pattern
                |                     # Or
                (?&str_single_quotes) # Recurse/use the str_single_quotes pattern
                |                     # Or
                (?&array)             # Recurse/use the array pattern
                |                     # Or
                (?&variable)          # Recurse/use the array pattern
                |                     # Or
                [^()]                 # Anything not a bracket
                |                     # Or
                (?&brackets)          # Recurse the bracket pattern
             )*
          \)
       )
    )                                 # End of definitions
    # Let's start matching for real now:
    (?&array)
    |
    (?&str_double_quotes)
    |
    (?&str_single_quotes)
    |
    (?&variable)
    ~six
    regex;
    
    $input = <<<'input'
    _  ("foo") // want "foo"
    _n("bar", "baz", 42); // want "bar", "baz", 42
    _n(domain, "bux", var); // want domain, "bux", var
    _( "one (optional)" ); // want "one (optional)"
    apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples) // could have on the same line two calls..
    
    // misleading cases
    _n("foo (")
    _n("foo (\)", 'foo)', aa)
    _n( Array(1, 2, 3), Array(")",   '(')   );
    _n(function(foo){return foo*2;}); // Is this even valid?
    _n   ();   // Empty
    _ (   
        "Foo",
        'Bar',
        Array(
            "wow",
            "much",
            'whitespaces'
        ),
        multiline
    ); // PCRE is awesome
    input;
    
    if(preg_match_all($functionPattern, $input, $m)){
        $filtered = preg_replace(
            '~          # Delimiter
            ^           # Assert begin of string
            \(          # Match an opening bracket
            \s*         # Match optional whitespaces
            |           # Or
            \s*         # Match optional whitespaces
            \)          # Match a closing bracket
            $           # Assert end of string
            ~x', // Regex
            '', // Replace with nothing
            $m['results'] // Subject
        ); // Getting rid of opening & closing brackets
    
        // Part 3: extract arguments:
        $parsedTree = array();
        foreach($filtered as $arguments){   // Loop
            if(preg_match_all($argumentsPattern, $arguments, $m)){ // If there's a match
                $parsedTree[] = array(
                    'all_arguments' => $arguments,
                    'branches' => $m[0]
                ); // Add an array to our tree and fill it
            }else{
                $parsedTree[] = array(
                    'all_arguments' => $arguments,
                    'branches' => array()
                ); // Add an array with empty branches
            }
        }
    
        print_r($parsedTree); // Let's see the results;
    }else{
        echo 'no matches';
    }
    

    Online php demo

    You might want to create a recursive function to generate a full tree. See this answer.

    You might notice that the function(){} part isn't parsed correctly. I will let that as an exercise for the readers :)

提交回复
热议问题