I\'m parsing a simple language (Excel formulas) for the functions contained within. A function name must start with any letter, followed by any number of letters/numbers, and e
This is well within the capabilities of .NET regexes. Here's a working demo:
using System;
using System.Text.RegularExpressions;
namespace Test
{
class Test
{
public static void Main()
{
Regex r = new Regex(@"
(?[a-z][a-z0-9]*\()
(?
(?>
\((?)
|
\)(?<-DEPTH>)
|
[^()]+
)*
(?(DEPTH)(?!))
)
\)", RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);
string formula = @"=Date(Year(A$5),Month(A$5),1)-(Weekday(Date(Year((A$5+1)),Month(A$5),1))-1)+{0;1;2;3;4;5}*7+{1,2,3,4,5,6,7}-1";
foreach (Match m in r.Matches(formula))
{
Console.WriteLine("{0}\n", m.Value);
}
}
}
}
output:
Date(Year(A$5),Month(A$5),1) Weekday(Date(Year((A$5+1)),Month(A$5),1))
The main problem with your regex was that you were including the function name as part of the recursive match--for example:
Name1(...Name2(...)...)
Any open-paren that wasn't preceded by name was not counted, because it was matched by the final alternative, |.?
), and that threw off the balance with the close-parens. That also meant that you couldn't match formulas like =MyFunc((1+1))
, which you mentioned in the text but didn't include in the example. (I threw in an extra set of parens to demonstrate.)
EDIT: Here's the version with support for non-significant, quoted parens:
Regex r = new Regex(@"
(?[a-z][a-z0-9]*\()
(?
(?>
\((?)
|
\)(?<-DEPTH>)
|
""[^""]+""
|
[^()""]+
)*
(?(DEPTH)(?!))
)
\)", RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);