How to split a sentence with an escaped whitespace?

十年热恋 提交于 2019-12-14 02:49:54

问题


I want to split my sentence using whitespace as my delimiter except for escaped whitespaces. Using boost::split and regex, how can I split it? If not possible, how else?

Example:

std::string sentence = "My dog Fluffy\\ Cake likes to jump";

Result:
My
dog
Fluffy\ Cake
likes
to
jump


回答1:


Three implementations:

  1. With Boost Spirit
  2. With Boost Regex
  3. Handwritten parser

With Boost Spirit

Here's how I'd do this with Boost Spirit. This might seem overkill, but experience teaches me that once you're splitting input text you will likely require more parsing logic.

Boost Spirit shines when you scale from "just splitting tokens" to a real grammar with production rules.

Live On Coliru

#include <boost/spirit/include/qi.hpp>
namespace qi = boost::spirit::qi;

int main() {
    std::string const sentence = "My dog Fluffy\\ Cake likes to jump";
    using It = std::string::const_iterator;
    It f = sentence.begin(), l = sentence.end();

    std::vector<std::string> words;

    bool ok = qi::phrase_parse(f, l,
            *qi::lexeme [ +('\\' >> qi::char_ | qi::graph) ], // words
            qi::space - "\\ ", // skipper
            words);

    if (ok) {
        std::cout << "Parsed:\n";
        for (auto& w : words)
            std::cout << "\t'" << w << "'\n";
    } else {
        std::cout << "Parse failed\n";
    }

    if (f != l)
        std::cout << "Remaining unparsed: '" << std::string(f,l) << "'\n";
}

With Boost Regex

This looks really succinct but

  • requires linking to boost_regex
  • uses "black magic" negative look behind assertion: http://www.regular-expressions.info/lookaround.html

Live On Coliru

#include <iostream>
#include <boost/regex.hpp>
#include <boost/algorithm/string_regex.hpp>
#include <vector>

int main() {
    std::string const sentence = "My dog Fluffy\\ Cake likes to jump";

    std::vector<std::string> words;
    boost::algorithm::split_regex(words, sentence, boost::regex("(?<!\\\\)\\s"), boost::match_default);

    for (auto& w : words)
        std::cout << " '" << w << "'\n";
}

Using c++11 raw literals you could write the regular expression slightly less obscurely: boost::regex(R"((?<!\\)\s)"), meaning "any whitespace not following a backslash"

Handwritten parser

This is somewhat more tedious, but like the Spirit grammar is completely generic, and allow nice performance.

However, it doesn't nearly scale as gracefully as the Spirit approach once you start adding complexity to your grammar. An advantage is that you spend less time compiling the code than with the Spirit version.

Live On Coliru

#include <iostream>
#include <iterator>
#include <vector>

template <typename It, typename Out>
Out tokens(It f, It l, Out out) {
    std::string accum;
    auto flush = [&] { 
        if (!accum.empty()) {
            *out++ = accum;
            accum.resize(0);
        }
    };

    while (f!=l) {
        switch(*f) {
            case '\\': 
                if (++f!=l && *f==' ')
                    accum += ' ';
                else
                    accum += '\\';
                break;
            case ' ': case '\t': case '\r': case '\n':
                ++f;
                flush();
                break;
            default:
                accum += *f++;
        }
    }
    flush();
    return out;
}

int main() {
    std::string const sentence = "My dog Fluffy\\ Cake likes to jump";

    std::vector<std::string> words;

    tokens(sentence.begin(), sentence.end(), back_inserter(words));

    for (auto& w : words)
        std::cout << "\t'" << w << "'\n";
}


来源:https://stackoverflow.com/questions/29380897/how-to-split-a-sentence-with-an-escaped-whitespace

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!