How to make Boost.Spirit.Lex token value be a substring of matched sequence (preferably by regex matching group)

前端 未结 1 727
面向向阳花
面向向阳花 2021-01-15 15:30

I\'m writing a simple expressions parser. It is build on a Boost.Spirit.Qi grammar based on Boost.Spirit.Lex tokens (Boost in version 1.56).

The tokens are defined a

相关标签:
1条回答
  • 2021-01-15 15:51

    Simple Solution

    Correct form of constructing the std::string attribute value is following:

    variable[lex::_val = boost::phoenix::construct<std::string>(lex::_start + 1, lex::_end)]
    

    exactly as suggested by jv_ in his (or her) comment.

    boost::phoenix::construct is provided by <boost/phoenix/object/construct.hpp> header. Or use <boost/phoenix.hpp>.

    Regular Expression Solution

    The above solution however works well only in simple cases. And excludes the possibility to have the pattern provided from outside (from configuration data in particular). Since changing the pattern for example to %(\\w+)% would require to change the value construction code.

    That is why it would be much better to be able to refer to capture groups from the regular expression defining the token.

    Now note that this still isn't perfect since weird cases like %(\\w+)%(\\w+)% would still require change in the code to be handled correctly. That could be worked around by configuring not only the regex for the token but also means to form the value from the matched range. Yet this goes out of the scope of the question. Using capture groups directly seems flexible enough for many cases.

    sehe in a comment elsewhere stated, that there is no way to use capture groups from token's regular expression. Not to mention that tokens actually support only a subset of regular expressions. (Among notable differences there is for example lack of support for naming capture groups or ignoring them!).

    My own experiments in this area support that as well. There is no way to use capture groups sadly. There is a workaround however - you have to just re-apply the regex in your action.

    Action Obtaining Capture Range

    To make it a little bit modular let's start with a simplest task - an action which returns boost::iterator_range part of the token's match corresponding to specified capture.

    template<typename Attribute, typename Char, typename Idtype>
    class basic_get_capture
    {
    public:
        typedef lex::token_def<Attribute, Char, Idtype> token_type;
        typedef boost::basic_regex<Char> regex_type;
    
        explicit basic_get_capture(token_type const& token, int capture_index = 1)
            : token(token),
              regex(),
              capture_index(capture_index)
        {
        }
    
        template<typename Iterator, typename IdType, typename Context>
        boost::iterator_range<Iterator> operator ()(Iterator& first, Iterator& last, lex::pass_flags& /*flag*/, IdType& /*id*/, Context& /*context*/)
        {
            typedef boost::match_results<Iterator> match_results_type;
    
            match_results_type results;
            regex_match(first, last, results, get_regex());
            typename match_results_type::const_reference capture = results[capture_index];
            return boost::iterator_range<Iterator>(capture.first, capture.second);
        }
    
    private:
        regex_type& get_regex()
        {
            if(regex.empty())
            {
                token_type::string_type const& regex_text = token.definition();
                regex.assign(regex_text);
            }
            return regex;
        }
    
        token_type const& token;
        regex_type regex;
        int capture_index;
    };
    
    template<typename Attribute, typename Char, typename Idtype>
    basic_get_capture<Attribute, Char, Idtype> get_capture(lex::token_def<Attribute, Char, Idtype> const& token, int capture_index = 1)
    {
        return basic_get_capture<Attribute, Char, Idtype>(token, capture_index);
    }
    

    The action uses Boost.Regex (include <boost/regex.hpp>).

    Action Obtaining Capture as String

    Now as the capture range is a nice thing to have as it doesn't allocate any new memory for the string, it is the string that we want in the end after all. So here another action build upon the previous one.

    template<typename Attribute, typename Char, typename Idtype>
    class basic_get_capture_as_string
    {
    public:
        typedef basic_get_capture<Attribute, Char, Idtype> basic_get_capture_type;
        typedef typename basic_get_capture_type::token_type token_type;
    
        explicit basic_get_capture_as_string(token_type const& token, int capture_index = 1)
            : get_capture_functor(token, capture_index)
        {
        }
    
        template<typename Iterator, typename IdType, typename Context>
        std::basic_string<Char> operator ()(Iterator& first, Iterator& last, lex::pass_flags& flag, IdType& id, Context& context)
        {
            boost::iterator_range<Iterator> const& capture = get_capture_functor(first, last, flag, id, context);
            return std::basic_string<Char>(capture.begin(), capture.end());
        }
    
    private:
        basic_get_capture_type get_capture_functor;
    };
    
    template<typename Attribute, typename Char, typename Idtype>
    basic_get_capture_as_string<Attribute, Char, Idtype> get_capture_as_string(lex::token_def<Attribute, Char, Idtype> const& token, int capture_index = 1)
    {
        return basic_get_capture_as_string<Attribute, Char, Idtype>(token, capture_index);
    }
    

    No magic here. We just make an std::basic_string from the range returned by the simpler action.

    Action Assigning Value From the Capture

    Actions that return a value are of little use for us. Ultimate goal is to set token value from the capture. And this is done by the last action.

    template<typename Attribute, typename Char, typename Idtype>
    class basic_set_val_from_capture
    {
    public:
        typedef basic_get_capture_as_string<Attribute, Char, Idtype> basic_get_capture_as_string_type;
        typedef typename basic_get_capture_as_string_type::token_type token_type;
    
        explicit basic_set_val_from_capture(token_type const& token, int capture_index = 1)
            : get_capture_as_string_functor(token, capture_index)
        {
        }
    
        template<typename Iterator, typename IdType, typename Context>
        void operator ()(Iterator& first, Iterator& last, lex::pass_flags& flag, IdType& id, Context& context)
        {
            std::basic_string<Char> const& capture = get_capture_as_string_functor(first, last, flag, id, context);
            context.set_value(capture);
        }
    
    private:
        basic_get_capture_as_string_type get_capture_as_string_functor;
    };
    
    template<typename Attribute, typename Char, typename Idtype>
    basic_set_val_from_capture<Attribute, Char, Idtype> set_val_from_capture(lex::token_def<Attribute, Char, Idtype> const& token, int capture_index = 1)
    {
        return basic_set_val_from_capture<Attribute, Char, Idtype>(token, capture_index);
    }
    

    Discussion

    The actions are used like this:

    variable[set_val_from_capture(variable)]
    

    Optionally you can provide a second argument being the index of capture to use. It defaults to 1 which seems suitable in most cases.

    Creating Functions

    set_val_from_capture (or get_capture_as_string or get_capture respectively) is an auxiliary function used for automatic deduction of template arguments from the token_def. In particular what we need is the Char type to make corresponding regular expression.

    I'm not sure if this could be reasonably avoided and even if so then it would significantly complicated the call operator (especially if we would strive to cache the regex object instead of building it each time anew). My doubts come mostly from not being sure whether Char type of token_def is required to be the same as the tokenized sequence character type or not. I assumed that they don't have to be the same.

    Repeating the Token

    Definitely unpleasant part of the action is the need to provide the token itself as an argument making a repetition.

    The token is however needed for the Char type as described above and to... get the regular expression!

    It seems to me that at least in theory we could be able to obtain the token somehow "at run-time" based on id argument to the action (which we just ignore currently). However I failed to find any way how to obtain token_def based on token's identifier regardless whether from context argument or the lexer itself (which could be passed to the action as this through creating function).

    Reusability

    Since those are actions they are not really reusable (out of the box) in more complex scenarios. For example if you would like to not only get just the capture but also convert it to some numeric value you would have to write another action this way instead of making a complex action at the token.

    At first I tried to achieve something like this:

    variable[lex::_val = get_capture_as_string(variable)]
    

    It seems like more flexible as you could easily add more code around it - like for example wrap it in some conversion function.

    But I failed to achieve it. Although I feel like I didn't try hard enough. Learning more about Boost.Phoenix would surely help here a lot.

    Double Work

    All this workaround doesn't prevent us from doing double work. Both at regex parsing and then matching. But as mentioned in the beginning it seems that there is no better way (without altering Boost.Spirit itself).

    0 讨论(0)
提交回复
热议问题