Before passing a string to eval() I would like to make sure the syntax is correct and allow:
yes, you need the Tokenizer, or something similar, but it's only part of the story. A tokenizer (more commonly called "lexer") can only read and parse elements of an expression, but has no means to detect that something like "foo()+*bar)" is invalid. You need the second part, called parser which would be able to arrange tokens in a kind of a tree (called "AST") or provide an error message when failing to do so. Ironically, once you've got a tree, "eval" is not needed anymore, you can evaluate your expression directly from the tree.
I would recommend you to write a parser by hand because it's a very useful exercise and a lot of fun. Recursive descent parsers are quite easy to program.
You could use token_get_all(), inspect each token, and abort at the first invalid token.
As far as validation is concerned, the following character tokens are valid:
operator: [/*+-]
funcs: (a\(|b\()
brackets: [()]
numbers: \d+(\.\d+)?
space: [ ]
A simple validation could then check if the input string matches any combination of these patterns. Because the funcs
token is pretty precise and it does not clash much with other tokens, this validation should be quite stable w/o the need implementing any syntax/grammar already:
$tokens = array(
'operator' => '[/*+-]',
'funcs' => '(a\(|b\()',
'brackets' => '[()]',
'numbers' => '\d+(\.\d+)?',
'space' => '[ ]',
);
$pattern = '';
foreach($tokens as $token)
{
$pattern .= sprintf('|(?:%s)', $token);
}
$pattern = sprintf('~^(%s)*$~', ltrim($pattern, '|'));
echo $pattern;
Only if the whole input string matches against the token based pattern, it validates. It still might be syntactically wrong PHP, put you can ensure it only is build upon the specified tokens:
~^((?:[/*+-])|(?:(a\(|b\())|(?:[()])|(?:\d+(\.\d+)?)|(?:[ ]))*$~
If you build the pattern dynamically - as in the example - you're able to modify your language tokens later on more easily.
Additionally this can be the first step to your own tokenizer / lexer. The token stream can then passed on to a parser which can syntactically validate and interpret it. That's the part user187291 wrote about.
Alternatively to writing a full lexer+parser, and you need to validate the syntax, you can formulate your grammar based on tokens as well and then do a regex based token grammar on the token representation of the input.
The tokens are the words you use in your grammar. You will need to describe parenthesis and function definition more precisely then in tokens, and the tokenizer should follow more clear rules which token supersedes another token. The concept is outlined in another question of mine. It uses regex as well for grammar formulation and syntax validation, but it still does not parse. In your case eval
would be the parser you're making use of.
hakre's answer, using regexes is a nice solution, but is a wee bit complicated. Also handling a whitelist of functions becomes rather messy. And if this does go wrong it could have a very nasty effect on your system.
Is there a reason you don't use the javascript 'eval' instead?
Parser generators have indeed already been written for PHP, and "LIME" in particular comes with the typical "calculator" example, which would be an obvious starting point for your "mini language": http://sourceforge.net/projects/lime-php/
It's been years since I last played with LIME, but it was already mature & stable then.
Notes:
1) Using a full-on parser generator gives you the advantage of avoiding PHP eval() entirely if you wish - you can make LIME emit a parser which effectively provides an "eval" function for expressions written in your mini language (with validation baked in). This gives you the additional advantage of allowing you to add support for new functions, as needed.
2) It may seem like overkill at first to use a parser generator for such an apparently small task, but once you get the examples working you'll be impressed by how easy it is to modify and extend them. And it's very easy to underestimate the difficulty of writing a bug-free parser (even a "trivial" one) from scratch.