I\'m writing a simple expressions parser. It is build on a Boost.Spirit.Qi grammar based on Boost.Spirit.Lex tokens (Boost in version 1.56).
The tokens are defined a
Correct form of constructing the std::string
attribute value is following:
variable[lex::_val = boost::phoenix::construct(lex::_start + 1, lex::_end)]
exactly as suggested by jv_ in his (or her) comment.
boost::phoenix::construct
is provided by
header. Or use
.
The above solution however works well only in simple cases. And excludes the possibility to have the pattern provided from outside (from configuration data in particular). Since changing the pattern for example to %(\\w+)%
would require to change the value construction code.
That is why it would be much better to be able to refer to capture groups from the regular expression defining the token.
Now note that this still isn't perfect since weird cases like %(\\w+)%(\\w+)%
would still require change in the code to be handled correctly. That could be worked around by configuring not only the regex for the token but also means to form the value from the matched range. Yet this goes out of the scope of the question. Using capture groups directly seems flexible enough for many cases.
sehe in a comment elsewhere stated, that there is no way to use capture groups from token's regular expression. Not to mention that tokens actually support only a subset of regular expressions. (Among notable differences there is for example lack of support for naming capture groups or ignoring them!).
My own experiments in this area support that as well. There is no way to use capture groups sadly. There is a workaround however - you have to just re-apply the regex in your action.
To make it a little bit modular let's start with a simplest task - an action which returns boost::iterator_range
part of the token's match corresponding to specified capture.
template
class basic_get_capture
{
public:
typedef lex::token_def token_type;
typedef boost::basic_regex regex_type;
explicit basic_get_capture(token_type const& token, int capture_index = 1)
: token(token),
regex(),
capture_index(capture_index)
{
}
template
boost::iterator_range operator ()(Iterator& first, Iterator& last, lex::pass_flags& /*flag*/, IdType& /*id*/, Context& /*context*/)
{
typedef boost::match_results match_results_type;
match_results_type results;
regex_match(first, last, results, get_regex());
typename match_results_type::const_reference capture = results[capture_index];
return boost::iterator_range(capture.first, capture.second);
}
private:
regex_type& get_regex()
{
if(regex.empty())
{
token_type::string_type const& regex_text = token.definition();
regex.assign(regex_text);
}
return regex;
}
token_type const& token;
regex_type regex;
int capture_index;
};
template
basic_get_capture get_capture(lex::token_def const& token, int capture_index = 1)
{
return basic_get_capture(token, capture_index);
}
The action uses Boost.Regex (include
).
Now as the capture range is a nice thing to have as it doesn't allocate any new memory for the string, it is the string that we want in the end after all. So here another action build upon the previous one.
template
class basic_get_capture_as_string
{
public:
typedef basic_get_capture basic_get_capture_type;
typedef typename basic_get_capture_type::token_type token_type;
explicit basic_get_capture_as_string(token_type const& token, int capture_index = 1)
: get_capture_functor(token, capture_index)
{
}
template
std::basic_string operator ()(Iterator& first, Iterator& last, lex::pass_flags& flag, IdType& id, Context& context)
{
boost::iterator_range const& capture = get_capture_functor(first, last, flag, id, context);
return std::basic_string(capture.begin(), capture.end());
}
private:
basic_get_capture_type get_capture_functor;
};
template
basic_get_capture_as_string get_capture_as_string(lex::token_def const& token, int capture_index = 1)
{
return basic_get_capture_as_string(token, capture_index);
}
No magic here. We just make an std::basic_string
from the range returned by the simpler action.
Actions that return a value are of little use for us. Ultimate goal is to set token value from the capture. And this is done by the last action.
template
class basic_set_val_from_capture
{
public:
typedef basic_get_capture_as_string basic_get_capture_as_string_type;
typedef typename basic_get_capture_as_string_type::token_type token_type;
explicit basic_set_val_from_capture(token_type const& token, int capture_index = 1)
: get_capture_as_string_functor(token, capture_index)
{
}
template
void operator ()(Iterator& first, Iterator& last, lex::pass_flags& flag, IdType& id, Context& context)
{
std::basic_string const& capture = get_capture_as_string_functor(first, last, flag, id, context);
context.set_value(capture);
}
private:
basic_get_capture_as_string_type get_capture_as_string_functor;
};
template
basic_set_val_from_capture set_val_from_capture(lex::token_def const& token, int capture_index = 1)
{
return basic_set_val_from_capture(token, capture_index);
}
The actions are used like this:
variable[set_val_from_capture(variable)]
Optionally you can provide a second argument being the index of capture to use. It defaults to 1
which seems suitable in most cases.
Creating Functions
set_val_from_capture
(or get_capture_as_string
or get_capture
respectively) is an auxiliary function used for automatic deduction of template arguments from the token_def
. In particular what we need is the Char
type to make corresponding regular expression.
I'm not sure if this could be reasonably avoided and even if so then it would significantly complicated the call operator (especially if we would strive to cache the regex object instead of building it each time anew). My doubts come mostly from not being sure whether Char
type of token_def
is required to be the same as the tokenized sequence character type or not. I assumed that they don't have to be the same.
Repeating the Token
Definitely unpleasant part of the action is the need to provide the token itself as an argument making a repetition.
The token is however needed for the Char
type as described above and to... get the regular expression!
It seems to me that at least in theory we could be able to obtain the token somehow "at run-time" based on id
argument to the action (which we just ignore currently). However I failed to find any way how to obtain token_def
based on token's identifier regardless whether from context
argument or the lexer itself (which could be passed to the action as this
through creating function).
Reusability
Since those are actions they are not really reusable (out of the box) in more complex scenarios. For example if you would like to not only get just the capture but also convert it to some numeric value you would have to write another action this way instead of making a complex action at the token.
At first I tried to achieve something like this:
variable[lex::_val = get_capture_as_string(variable)]
It seems like more flexible as you could easily add more code around it - like for example wrap it in some conversion function.
But I failed to achieve it. Although I feel like I didn't try hard enough. Learning more about Boost.Phoenix would surely help here a lot.
Double Work
All this workaround doesn't prevent us from doing double work. Both at regex parsing and then matching. But as mentioned in the beginning it seems that there is no better way (without altering Boost.Spirit itself).