how to read extreme long lines from text file fast and safe in C++?

匿名 (未验证) 提交于 2019-12-03 02:20:02

问题:

There is a large text file of 6.53 GiB. Each line of it can be a data line or comment line. Comment lines are usually short, less than 80 characters, while a data line contains more than 2 million characters and is variable-length.

Considering each data line needs to be dealt with as a unit, is there a simple way to read lines safe and fast in C++?

safe (safe for variable-length data lines): The solution is as easy to use as std::getline(). Since the length is changing, it is hoped to avoid extra memory management.

fast: The solution can achieve as fast as readline() in python 3.6.0, or even as fast as fgets() of stdio.h.

A Pure C solution is welcomed. The interface for further processing is provided both in C and C++.


UPDATE 1: Thanks to short but invaluable comment from Basile Starynkevitch, the perfect solution comes up: POSIX getline(). Since further processing only involves converting from character to number and does not use many features of string class, a char array would be sufficient in this application.


UPDATE 2: Thanks to comments from Zulan and Galik, who both report comparable performance among std::getline(), fgets() and POSIX getline(), another possible solution is to use a better standard library implementation such as libstdc++. Moreover, here is a report claiming that the Visual C++ and libc++ implementations of std::getline is not well optimised.

Moving from libc++ to libstdc++ changes the results a lot. With libstdc++ 3.4.13 / Linux 2.6.32 on a different platform, POSIX getline(), std::getline() and fgets() show comparable performance. At the beginning, codes were run under the default settings of clang in Xcode 8.3.2 (8E2002), thus libc++ is used.


More details and some efforts (very long):

getline() of <string> can handle arbitrary long lines but is a bit slow. Is there an alternative in C++ for readline() in python?

// benchmark on Mac OS X with libc++ and SSD: readline() of python                         ~550 MiB/s  fgets() of stdio.h, -O0 / -O2               ~1100 MiB/s  getline() of string, -O0                      ~27 MiB/s getline() of string, -O2                     ~150 MiB/s getline() of string + stack buffer, -O2      ~150 MiB/s  getline() of ifstream, -O0 / -O2             ~240 MiB/s read() of ifstream, -O2                      ~340 MiB/s  wc -l                                        ~670 MiB/s  cat data.txt | ./read-cin-unsync              ~20 MiB/s  getline() of stdio.h (POSIX.1-2008), -O0    ~1300 MiB/s 
  • Speeds are rounded very roughly, only to show the magnitude, and all code blocks are run several times to assure that the values are representative.

  • '-O0 / -O2' means the speeds are very similar for both optimization levels

  • Codes are shown as follows.


readline() of python

# readline.py  import time import os  t_start = time.perf_counter()  fname = 'data.txt' fin = open(fname, 'rt')  count = 0  while True:     l = fin.readline()     length = len(l)     if length == 0:     # EOF         break     if length > 80:     # data line         count += 1  fin.close()  t_end = time.perf_counter() time = t_end - t_start  fsize = os.path.getsize(fname)/1024/1024   # file size in MiB print("speed: %d MiB/s" %(fsize/time)) print("reads %d data lines" %count)  # run as `python readline.py` with python 3.6.0 

fgets() of stdio.h

#include <stdio.h> #include <stdlib.h> #include <time.h> #include <string.h>  int main(int argc, char* argv[]){   clock_t t_start = clock();    if(argc != 2) {     fprintf(stderr, "needs one input argument\n");     return EXIT_FAILURE;   }    FILE* fp = fopen(argv[1], "r");   if(fp == NULL) {     perror("Failed to open file");     return EXIT_FAILURE;   }    // maximum length of lines, determined previously by python   const int SIZE = 1024*1024*3;   char line[SIZE];    int count = 0;   while(fgets(line, SIZE, fp) == line) {     if(strlen(line) > 80) {       count += 1;     }   }    clock_t t_end = clock();    const double fsize = 6685;  // file size in MiB    double time = (t_end-t_start) / (double)CLOCKS_PER_SEC;    fprintf(stdout, "takes %.2f s\n", time);   fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));   fprintf(stdout, "reads %d data lines\n", count);    return EXIT_SUCCESS; } 

getline() of <string>

// readline-string-getline.cpp #include <string> #include <fstream> #include <iostream> #include <ctime> #include <cstdlib>  using namespace std;  int main(int argc, char* argv[]) {   clock_t t_start = clock();    if(argc != 2) {     fprintf(stderr, "needs one input argument\n");     return EXIT_FAILURE;   }    // manually set the buffer on stack   const int BUFFERSIZE = 1024*1024*3;   // stack on my platform is 8 MiB   char buffer[BUFFERSIZE];   ifstream fin;   fin.rdbuf()->pubsetbuf(buffer, BUFFERSIZE);   fin.open(argv[1]);    // default buffer setting   // ifstream fin(argv[1]);    if(!fin) {     perror("Failed to open file");     return EXIT_FAILURE;   }    // maximum length of lines, determined previously by python   const int SIZE = 1024*1024*3;   string line;   line.reserve(SIZE);    int count = 0;   while(getline(fin, line)) {     if(line.size() > 80) {       count += 1;     }   }    clock_t t_end = clock();    const double fsize = 6685;  // file size in MiB    double time = (t_end-t_start) / (double)CLOCKS_PER_SEC;    fprintf(stdout, "takes %.2f s\n", time);   fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));   fprintf(stdout, "reads %d data lines\n", count);    return EXIT_SUCCESS; } 

getline() of ifstream

// readline-ifstream-getline.cpp #include <fstream> #include <iostream> #include <ctime> #include <cstdlib>  using namespace std;  int main(int argc, char* argv[]) {   clock_t t_start = clock();    if(argc != 2) {     fprintf(stderr, "needs one input argument\n");     return EXIT_FAILURE;   }    ifstream fin(argv[1]);   if(!fin) {     perror("Failed to open file");     return EXIT_FAILURE;   }    // maximum length of lines, determined previously by python   const int SIZE = 1024*1024*3;   char line[SIZE];    int count = 0;   while(fin.getline(line, SIZE)) {     if(strlen(line) > 80) {       count += 1;     }   }    clock_t t_end = clock();    const double fsize = 6685;  // file size in MiB    double time = (t_end-t_start) / (double)CLOCKS_PER_SEC;    fprintf(stdout, "takes %.2f s\n", time);   fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));   fprintf(stdout, "reads %d data lines\n", count);    return EXIT_SUCCESS; } 

read() of ifstream

// seq-read-bin.cpp // sequentially read the file to see the speed upper bound of // ifstream  #include <iostream> #include <fstream> #include <ctime>  using namespace std;   int main(int argc, char* argv[]) {   clock_t t_start = clock();    if(argc != 2) {     fprintf(stderr, "needs one input argument\n");     return EXIT_FAILURE;   }    ifstream fin(argv[1], ios::binary);    const int SIZE = 1024*1024*3;   char str[SIZE];    while(fin) {     fin.read(str,SIZE);   }    clock_t t_end = clock();   double time = (t_end-t_start) / (double)CLOCKS_PER_SEC;    const double fsize = 6685;  // file size in MiB    fprintf(stdout, "takes %.2f s\n", time);   fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));    return EXIT_SUCCESS; } 

use cat, then read from cin with cin.sync_with_stdio(false)

#include <iostream> #include <ctime> #include <cstdlib>  using namespace std;  int main(void) {   clock_t t_start = clock();    string input_line;    cin.sync_with_stdio(false);    while(cin) {     getline(cin, input_line);   }    double time = (clock() - t_start) / (double)CLOCKS_PER_SEC;    const double fsize = 6685;  // file size in MiB    fprintf(stdout, "takes %.2f s\n", time);   fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));    return EXIT_SUCCESS; } 

POSIX getline()

// readline-c-getline.c #include <stdio.h> #include <stdlib.h> #include <time.h>  int main(int argc, char *argv[]) {    clock_t t_start = clock();    char *line = NULL;   size_t len = 0;   ssize_t nread;    if (argc != 2) {     fprintf(stderr, "Usage: %s <file>\n", argv[1]);     exit(EXIT_FAILURE);   }    FILE *stream = fopen(argv[1], "r");   if (stream == NULL) {     perror("fopen");     exit(EXIT_FAILURE);   }    int length = -1;   int count = 0;   while ((nread = getline(&line, &len, stream)) != -1) {     if (nread > 80) {       count += 1;     }   }    free(line);   fclose(stream);    double time = (clock() - t_start) / (double)CLOCKS_PER_SEC;   const double fsize = 6685;  // file size in MiB   fprintf(stdout, "takes %.2f s\n", time);   fprintf(stdout, "speed: %d MiB/s\n", (int)(fsize/time));   fprintf(stdout, "reads %d data lines.\n", count);   // fprintf(stdout, "length of MSA: %d\n", length-1);    exit(EXIT_SUCCESS); } 

回答1:

As I commented, on Linux & POSIX systems, you could consider using getline(3); I guess that the following could compile both as C and as C++ (assuming you do have some valid fopen-ed FILE*fil; ...)

char* linbuf = NULL; /// or nullptr in C++ size_t linsiz = 0; ssize_t linlen = 0;  while((linlen=getline(&linbuf, &linsiz,fil))>=0) {   // do something useful with linbuf; but no C++ exceptions } free(linbuf); linsiz=0; 

I guess this might work (or be easily adapted) to C++. But then, beware of C++ exceptions, they should not go thru the while loop (or you should ensure that an appropriate destructor or catch is doing free(linbuf);).

Also getline could fail (e.g. if it calls a failing malloc) and you might need to handle that failure sensibly.



回答2:

Well, the C standard library is a subset of the C++ standard library. From n4296 draft from C++ 2014 standard:

17.2 The C standard library [library.c]

The C++ standard library also makes available the facilities of the C standard library, suitably adjusted to ensure static type safety.

So provided you explain in a comment that a performance bottleneck requires it, it is perfectly fine to use fgets in a C++ program - simply you should carefully encapsulate it in an utility class, in order to preserve the OO high level structures.



回答3:

Yes, there's a faster way to read lines and create strings.

Query the file size, then load it into a buffer. Then iterate over the buffer replacing the newlines with nuls and storing the pointer to the next line.

It will be quite a bit faster if, as is likely, your platform has a call to load a file into memory.



易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!