问题
Is it a good idea? For a reason I thought it should be faster than boost's tokenizer or split. however most of the time I'm stuck in the boost::spirit::compile
template <typename Iterator>
struct ValueList : bsq::grammar<Iterator, std::vector<std::string>()>
{
ValueList(const std::string& sep, bool isCaseSensitive) : ValueList::base_type(query)
{
if(isCaseSensitive)
{
query = value >> *(sep >> value);
value = *(bsq::char_ - sep);
}
else
{
auto separator = bsq::no_case[sep];
query = value >> *(separator >> value);
value = *(bsq::char_ - separator);
}
}
bsq::rule<Iterator, std::vector<std::string>()> query;
bsq::rule<Iterator, std::string()> value;
};
inline bool Split(std::vector<std::string>& result, const std::string& buffer, const std::string& separator,
bool isCaseSensitive)
{
result.clear();
ValueList<std::string::const_iterator> parser(separator, isCaseSensitive);
auto itBeg = buffer.begin();
auto itEnd = buffer.end();
if(!(bsq::parse(itBeg, itEnd, parser, result) && (itBeg == itEnd)))
result.push_back(buffer);
return true;
}
I've implemented it as shown above. What is wrong with my code? or just because the separator is defined in runtime the recompilation is inevitable?
EDIT001:
Example and comparison with possible implementation with boost::split and original imp with tokenizer on CoLiRu
Looks like coliru is down now. In any case these are result for 1M runs on string "2lkhj309|ioperwkl|20sdf39i|rjjdsf|klsdjf230o|kx23904iep2|xp39f4p2|xlmq2i3219" with separator "|"
8000000 splits in 1081ms.
8000000 splits in 1169ms.
8000000 splits in 2663ms.
first is for tokenizer, second is for boost::split and the third is for boost::spirit
回答1:
First off, the different versions do not do the same thing:
- in particular token compression behaves differently (I fixed it for
boost::split
but it doesn't appear to be a feature forboost::tokenizer
) - separators were treated as string literals rather than charsets in the Spirit version (fixed that)
Yes, recompiles are inevitable with dynamic separators. But no, this is not the bottleneck (the other approaches have dynamic separators too):
I've done some optimizations. The timings:
Coliru clang++:
8000000 original (boost::tokenizer) rate: 2.84257μs 10000000 possible (boost::split) rate: 3.09941μs 10000000 spirit (dynamic version) rate: 1.45456μs 10000000 spirit (direct version, avoid type erasure) rate: 1.25588μs next step: 10000000 spirit (precompiled sep) rate: 1.18059μs
Coliru g++
8000000 original (boost::tokenizer) rate: 2.92805μs 10000000 possible (boost::split) rate: 2.75442μs 10000000 spirit (dynamic version) rate: 1.32821μs 10000000 spirit (direct version, avoid type erasure) rate: 1.10712μs next step: 10000000 spirit (precompiled sep) rate: 1.0791μs
Local system g++:
sehe@desktop:/tmp$ time ./test 8000000 original (boost::tokenizer) rate: 1.80061μs 10000000 possible (boost::split) rate: 1.29754μs 10000000 spirit (dynamic version) rate: 0.607789μs 10000000 spirit (direct version, avoid type erasure) rate: 0.488087μs next step: 10000000 spirit (precompiled sep) rate: 0.498769μs
As you can see the Spirit approach doesn't need to be slower. What steps did I take? http://paste.ubuntu.com/11001344/
- refactor benchmark to display rate (no change) 3.523μs
- made case_insensitivity choice of caller (just use
no_case[char_(delimiter)]
if required) 2.742μs. - eliminate
value
subrule (reduced copying and dynamic dispatch because of type-erased non-terminal rule) 2.579μs. Made delimiter charset instead of string literal: 2.693μs.
see intermediate version on coliru. (I recommend the code below, it's much more cleaned up)
Using qi::raw[] instead of std::string synthesized attributes (avoid copying!) 0.624072μs
- Eliminating all non-terminals (i.e. type-erasure; see
spirit_direct
implementation) rate: 0.491011μs
Now it seems fairly obvious that all the implementations would benefit from not "compiling" the separator each time. I didn't do it for all the approaches, but for fun let's do it for the Spirit version:
- Using hardcoded '|' delimiter 0.455269μs // fixed delimiter
Full listing:
Live On Coliru
#include <boost/algorithm/string.hpp>
#include <boost/tokenizer.hpp>
#include <boost/spirit/include/qi.hpp>
#include <vector>
#include <string>
#include <chrono>
#include <iostream>
void original(std::vector<std::string>& result, const std::string& input, const std::string& delimiter)
{
result.clear();
boost::char_separator<char> sep(delimiter.c_str());
boost::tokenizer<boost::char_separator<char>, std::string::const_iterator, std::string> tok(input, sep);
for (const auto& token : tok)
{
result.push_back(token);
}
}
void possible(std::vector<std::string>& result, const std::string& input, const std::string& delimiter)
{
result.clear();
boost::split(result, input, boost::is_any_of(delimiter), boost::algorithm::token_compress_off);
}
namespace bsq = boost::spirit::qi;
void spirit_direct(std::vector<std::string>& result, const std::string& input, char const* delimiter)
{
result.clear();
using namespace bsq;
if (!parse(input.begin(), input.end(), raw[*(char_ - char_(delimiter))] % char_(delimiter), result))
result.push_back(input);
}
namespace detail {
template <typename Sep> bsq::rule<std::string::const_iterator, std::vector<std::string>()>
make_spirit_parser(Sep const& sep)
{
using namespace bsq;
return raw[*(char_ - sep)] % sep;
}
static const auto precompiled_pipes = make_spirit_parser('|');
}
void spirit(std::vector<std::string>& result, const std::string& input, char const* delimiter)
{
result.clear();
if (!bsq::parse(input.begin(), input.end(), detail::make_spirit_parser(bsq::char_(delimiter)), result))
result.push_back(input);
}
void spirit_pipes(std::vector<std::string>& result, const std::string& input)
{
result.clear();
if (!bsq::parse(input.begin(), input.end(), detail::precompiled_pipes, result))
result.push_back(input);
}
template <typename F> void bench(std::string const& caption, F approach) {
size_t const iterations = 1000000;
using namespace std::chrono;
using C = high_resolution_clock;
auto start = C::now();
size_t count = 0;
for (auto i = 0U; i < iterations; ++i) {
count += approach();
}
auto us = duration_cast<std::chrono::microseconds>(C::now() - start).count();
std::cout << count << " " << caption << " rate: " << (1.*us/iterations) << "μs\n";
}
int main()
{
std::string const input = "2309|ioperwkl|2039i|rjjdsf|klsdjf230o|kx23904iep2|xp,39,4p2|xlmq2i3219||";
auto separator = "|";
std::vector<std::string> result;
bench("original (boost::tokenizer)", [&] {
original(result, input, separator);
return result.size();
});
bench("possible (boost::split)", [&] {
possible(result, input, separator);
return result.size();
});
bench("spirit (dynamic version)", [&] {
spirit(result, input, separator);
return result.size();
});
bench("spirit (direct version, avoid type erasure)", [&] {
spirit_direct(result, input, separator);
return result.size();
});
std::cout << "\nnext step:\n";
bench("spirit (precompiled sep)", [&] {
spirit_pipes(result, input);
return result.size();
});
}
来源:https://stackoverflow.com/questions/30046512/splitting-string-using-boost-spirit