问题
So I have the following string of data, which is being received through a TCP winsock connection, and would like to do an advanced tokenization, into a vector of structs, where each struct represents one record.
std::string buf = "44:william:adama:commander:stuff\n33:luara:roslin:president:data\n"
struct table_t
{
std::string key;
std::string first;
std::string last;
std::string rank;
std::additional;
};
Each record in the string is delimited by a carriage return. My attempt at splitting up the records, but not yet splitting up the fields:
void tokenize(std::string& str, std::vector< string >records)
{
// Skip delimiters at beginning.
std::string::size_type lastPos = str.find_first_not_of("\n", 0);
// Find first "non-delimiter".
std::string::size_type pos = str.find_first_of("\n", lastPos);
while (std::string::npos != pos || std::string::npos != lastPos)
{
// Found a token, add it to the vector.
records.push_back(str.substr(lastPos, pos - lastPos));
// Skip delimiters. Note the "not_of"
lastPos = str.find_first_not_of("\n", pos);
// Find next "non-delimiter"
pos = str.find_first_of("\n", lastPos);
}
}
It seems totally unnecessary to repeat all of that code again to further tokenize each record via the colon (internal field separator) into the struct and push each struct into a vector. I'm sure there is a better way of doing this, or perhaps the design is in itself wrong.
Thank you for any help.
回答1:
For breaking the string up into records, I'd use istringstream, if only because that will simplify the changes later when I want to read from a file. For tokenizing, the most obvious solution is boost::regex, so:
std::vector<table_t> parse( std::istream& input )
{
std::vector<table_t> retval;
std::string line;
while ( std::getline( input, line ) ) {
static boost::regex const pattern(
"\([^:]*\):\([^:]*\):\([^:]*\):\([^:]*\):\([^:]*\)" );
boost::smatch matched;
if ( !regex_match( line, matched, pattern ) ) {
// Error handling...
} else {
retval.push_back(
table_t( matched[1], matched[2], matched[3],
matched[4], matched[5] ) );
}
}
return retval;
}
(I've assumed the logical constructor for table_t. Also: there's a very long tradition in C that names ending in _t are typedef's, so you're probably better off finding some other convention.)
回答2:
My solution:
struct colon_separated_only: std::ctype<char>
{
colon_separated_only(): std::ctype<char>(get_table()) {}
static std::ctype_base::mask const* get_table()
{
typedef std::ctype<char> cctype;
static const cctype::mask *const_rc= cctype::classic_table();
static cctype::mask rc[cctype::table_size];
std::memcpy(rc, const_rc, cctype::table_size * sizeof(cctype::mask));
rc[':'] = std::ctype_base::space;
return &rc[0];
}
};
struct table_t
{
std::string key;
std::string first;
std::string last;
std::string rank;
std::string additional;
};
int main() {
std::string buf = "44:william:adama:commander:stuff\n33:luara:roslin:president:data\n";
stringstream s(buf);
s.imbue(std::locale(std::locale(), new colon_separated_only()));
table_t t;
std::vector<table_t> data;
while ( s >> t.key >> t.first >> t.last >> t.rank >> t.additional )
{
data.push_back(t);
}
for(size_t i = 0 ; i < data.size() ; ++i )
{
cout << data[i].key <<" ";
cout << data[i].first <<" "<<data[i].last <<" ";
cout << data[i].rank <<" "<< data[i].additional << endl;
}
return 0;
}
Output:
44 william adama commander stuff
33 luara roslin president data
Online Demo : http://ideone.com/JwZuk
The technique I used here is described in my another solution to different question:
Elegant ways to count the frequency of words in a file
来源:https://stackoverflow.com/questions/5462022/tokenizing-a-string-of-data-into-a-vector-of-structs