问题
I need to read huge 35G file from disc line by line in C++. Currently I do it the following way:
ifstream infile("myfile.txt");
string line;
while (true) {
if (!getline(infile, line)) break;
long linepos = infile.tellg();
process(line,linepos);
}
But it gives me about 2MB/sec performance, though file manager copies the file with 100Mb/s speed. I guess that getline()
is not doing buffering correctly. Please propose some sort of buffered line-by-line reading approach.
UPD: process() is not a bottleneck, code without process() works with the same speed.
回答1:
You won't get anywhere close to line speed with the standard IO streams. Buffering or not, pretty much ANY parsing will kill your speed by orders of magnitude. I did experiments on datafiles composed of two ints and a double per line (Ivy Bridge chip, SSD):
- IO streams in various combinations: ~10 MB/s. Pure parsing (
f >> i1 >> i2 >> d
) is faster than agetline
into a string followed by asstringstream
parse. - C file operations like
fscanf
get about 40 MB/s. getline
with no parsing: 180 MB/s.fread
: 500-800 MB/s (depending on whether or not the file was cached by the OS).
I/O is not the bottleneck, parsing is. In other words, your process
is likely your slow point.
So I wrote a parallel parser. It's composed of tasks (using a TBB pipeline):
fread
large chunks (one such task at a time)- re-arrange chunks such that a line is not split between chunks (one such task at a time)
- parse chunk (many such tasks)
I can have unlimited parsing tasks because my data is unordered anyway. If yours isn't then this might not be worth it to you. This approach gets me about 100 MB/s on an 4-core IvyBridge chip.
回答2:
I've translated my own buffering code from my java project and it does what I need. I had to put defines to overcome problems with M$VC 2010 compiler tellg, that always gives wrong negative values on huge files. This algorithm gives desired speed ~100MB/s, though it does some usless new[].
void readFileFast(ifstream &file, void(*lineHandler)(char*str, int length, __int64 absPos)){
int BUF_SIZE = 40000;
file.seekg(0,ios::end);
ifstream::pos_type p = file.tellg();
#ifdef WIN32
__int64 fileSize = *(__int64*)(((char*)&p) +8);
#else
__int64 fileSize = p;
#endif
file.seekg(0,ios::beg);
BUF_SIZE = min(BUF_SIZE, fileSize);
char* buf = new char[BUF_SIZE];
int bufLength = BUF_SIZE;
file.read(buf, bufLength);
int strEnd = -1;
int strStart;
__int64 bufPosInFile = 0;
while (bufLength > 0) {
int i = strEnd + 1;
strStart = strEnd;
strEnd = -1;
for (; i < bufLength && i + bufPosInFile < fileSize; i++) {
if (buf[i] == '\n') {
strEnd = i;
break;
}
}
if (strEnd == -1) { // scroll buffer
if (strStart == -1) {
lineHandler(buf + strStart + 1, bufLength, bufPosInFile + strStart + 1);
bufPosInFile += bufLength;
bufLength = min(bufLength, fileSize - bufPosInFile);
delete[]buf;
buf = new char[bufLength];
file.read(buf, bufLength);
} else {
int movedLength = bufLength - strStart - 1;
memmove(buf,buf+strStart+1,movedLength);
bufPosInFile += strStart + 1;
int readSize = min(bufLength - movedLength, fileSize - bufPosInFile - movedLength);
if (readSize != 0)
file.read(buf + movedLength, readSize);
if (movedLength + readSize < bufLength) {
char *tmpbuf = new char[movedLength + readSize];
memmove(tmpbuf,buf,movedLength+readSize);
delete[]buf;
buf = tmpbuf;
bufLength = movedLength + readSize;
}
strEnd = -1;
}
} else {
lineHandler(buf+ strStart + 1, strEnd - strStart, bufPosInFile + strStart + 1);
}
}
lineHandler(0, 0, 0);//eof
}
void lineHandler(char*buf, int l, __int64 pos){
if(buf==0) return;
string s = string(buf, l);
printf(s.c_str());
}
void loadFile(){
ifstream infile("file");
readFileFast(infile,lineHandler);
}
回答3:
Use a line parser or write the same. here is a sample in the sourceforge http://tclap.sourceforge.net/ and put in a buffer if necessary.
来源:https://stackoverflow.com/questions/24851291/read-huge-text-file-line-by-line-in-c-with-buffering