I try to parse TPCH files with Boost Spirit QI. My implementation inspired by the employee example of Spirit QI ( http://www.boost.org/doc/libs/1_52_0/libs/spirit/example/qi
I found a solution to my problem. As described in this post Boost Spirit QI grammar slow for parsing delimited strings the performance bottleneck is the string handling of Spirit qi. All other data types seem to be quite fast.
I avoid this problem through doing the handling of the data on my own instead of using the Spirit qi handling.
My solution uses a helper class which offers functions for every field of the csv file. The functions store the values into a struct. Strings are stored in a char[]s. Hits the parser a newline character it calls a function which adds the struct to the result vector. The Boost parser calls this functions instead of storing the values into a vector on its own.
Here is my code for the region.tbl file of the TCPH Benchmark:
struct region{
int r_regionkey;
char r_name[25];
char r_comment[152];
};
class regionStorage{
public:
regionStorage(vector* regions) :regions(regions), pos(0) {}
void storer_regionkey(int const&i){
currentregion.r_regionkey = i;
}
void storer_name(char const&i){
currentregion.r_name[pos] = i;
pos++;
}
void storer_comment(char const&i){
currentregion.r_comment[pos] = i;
pos++;
}
void resetPos() {
pos = 0;
}
void endOfLine() {
pos = 0;
regions->push_back(currentregion);
}
private:
vector* regions;
region currentregion;
int pos;
};
void parseRegion(){
vector regions;
regionStorage regionstorageObject(®ions);
phrase_parse(dataPointer, /*< start iterator >*/
state->dataEndPointer, /*< end iterator >*/
(*(lexeme[
+(int_[boost::bind(®ionStorage::storer_regionkey, ®ionstorageObject, _1)] - '|') >> '|' >>
+(char_[boost::bind(®ionStorage::storer_name, ®ionstorageObject, _1)] - '|') >> char_('|')[boost::bind(®ionStorage::resetPos, ®ionstorageObject)] >>
+(char_[boost::bind(®ionStorage::storer_comment, ®ionstorageObject, _1)] - '|') >> char_('|')[boost::bind(®ionStorage::endOfLine, ®ionstorageObject)]
])), space);
cout << regions.size() << endl;
}
It is not a pretty solution but it works and it is much faster. ( 2.2 secs for 1 GB TCPH data, multithreaded)