How can I speed up line by line reading of an ASCII file? (C++)

前端 未结 9 921
夕颜
夕颜 2020-12-30 08:02

Here\'s a bit of code that is a considerable bottleneck after doing some measuring:

//-----------------------------------------------------------------------         


        
相关标签:
9条回答
  • 2020-12-30 08:33

    Reading the whole file in one go into memory and then operating on it in would probably be faster as it avoids repeatedly going back to the disk to read another chunk.

    Is 0.25s actually a problem? If you're not intending on loading much larger files is there any need to make it faster if it makes it less readable?

    0 讨论(0)
  • 2020-12-30 08:33

    A proper implementation of the IO library would cache the data for you, avoiding excessive disk accesses and system calls. I recommend that you use a system-call level tool (e.g. strace if you're under Linux) to check what actually happens with your IO.

    Obviously, dict.insert(xxx) could also be a nuisance if it doesn't allow O(1) insertion.

    0 讨论(0)
  • 2020-12-30 08:34

    Quick profiling on my system (linux-2.6.37, gcc-4.5.2, compiled with -O3) shows that I/O is not the bottleneck. Whether using fscanf into a char array followed by dict.insert() or operator>> as in your exact code, it takes about the same time (155 - 160 ms to read a 240k word file).

    Replacing gcc's std::unordered_set with std::vector<std::string> in your code drops the execution time to 45 ms (fscanf) - 55 ms (operator>>) for me. Try to profile IO and set insertion separately.

    0 讨论(0)
  • 2020-12-30 08:35

    You could get better performance, normally, by increasing the buffer size.

    Right after building the ifstream, you can set its internal buffer using:

    char LocalBuffer[4096]; // buffer
    
    std::ifstream wordListFile("dictionary.txt");
    
    wordListFile.rdbuf()->pubsetbuf(LocalBuffer, 4096);
    

    Note: rdbuf's result is guaranteed no to be null if the construction of ifstream succeeded

    Depending on the memory available, your are strongly encouraged to grow the buffer if possible in order to limit interaction with the HDD and the number of system calls.

    I've performed some simple measurements using a little benchmark of my own, you can find the code below (and I am interested in critics):

    gcc 3.4.2 on SLES 10 (sp 3)
    C : 9.52725e+06
    C++: 1.11238e+07
    difference: 1.59655e+06

    Which gives a slowdown of a whooping 17%.

    This takes into account:

    • automatic memory management (no buffer overflow)
    • automatic resources management (no risk to forget to close the file)
    • handling of locale

    So, we can argue that streams are slow... but please, don't throw your random piece of code and complains it's slow, optimization is hard work.


    Corresponding code, where benchmark is a little utility of my own which measure the time of a repeated execution (here launched for 50 iterations) using gettimeofday.

    #include <fstream>
    #include <iostream>
    #include <iomanip>
    
    #include <cmath>
    #include <cstdio>
    
    #include "benchmark.h"
    
    struct CRead
    {
      CRead(char const* filename): _filename(filename) {}
    
      void operator()()
      {
        FILE* file = fopen(_filename, "r");
    
        int count = 0;
        while ( fscanf(file,"%s", _buffer) == 1 ) { ++count; }
    
        fclose(file);
      }
    
      char const* _filename;
      char _buffer[1024];
    };
    
    struct CppRead
    {
      CppRead(char const* filename): _filename(filename), _buffer() {}
    
      enum { BufferSize = 16184 };
    
      void operator()()
      {
        std::ifstream file(_filename);
        file.rdbuf()->pubsetbuf(_buffer, BufferSize);
    
        int count = 0;
        std::string s;
        while ( file >> s ) { ++count; }
      }
    
      char const* _filename;
      char _buffer[BufferSize];
    };
    
    
    int main(int argc, char* argv[])
    {
      size_t iterations = 1;
      if (argc > 1) { iterations = atoi(argv[1]); }
    
      char const* filename = "largefile.txt";
    
      CRead cread(filename);
      CppRead cppread(filename);
    
      double ctime = benchmark(cread, iterations);
      double cpptime = benchmark(cppread, iterations);
    
      std::cout << "C  : " << ctime << "\n"
                   "C++: " << cpptime << "\n";
    
      return 0;
    }
    
    0 讨论(0)
  • 2020-12-30 08:35

    If you really want fast, ditch istream and string and create a trivial class Read_Only_Text around const char* & size, then memory map the file and insert into unordered_set<Read_Only_Text> with references to the embedded strings. It will mean you needlessly keep the 2mb file even though your number of unique keys may be much less, but it'll be very, very fast to populate. I know this is a pain, but I've done it several times for various tasks and the results are very good.

    0 讨论(0)
  • 2020-12-30 08:36

    Unfortunately, there's not much you can do to increase performance when using an fstream.

    You may be able to get a very slight speed improvement by reading in larger chunks of the file and then parsing out single words, but this depends on how your fstream implementation does buffering.

    The only way to get a big improvement is to use your OS's I/O functions. For example, on Windows, opening the file with the FILE_FLAG_SEQUENTIAL_SCAN flag may speed up reads, as well as using asynchronous reads to grab data from disk and parse it in parallel.

    0 讨论(0)
提交回复
热议问题