How can I speed up line by line reading of an ASCII file? (C++)

前端未结

关注

 9  921

Here\'s a bit of code that is a considerable bottleneck after doing some measuring:

//-----------------------------------------------------------------------


                      
              相关标签:


      
      
        
          9条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  野性不改        
                
              
                            
                2020-12-30 08:33
              
            
            
                                                                       
Reading the whole file in one go into memory and then operating on it in would probably be faster as it avoids repeatedly going back to the disk to read another chunk.

Is 0.25s actually a problem? If you're not intending on loading much larger files is there any need to make it faster if it makes it less readable?
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  清歌不尽        
                
              
                            
                2020-12-30 08:33
              
            
            
                                                                       
A proper implementation of the IO library would cache the data for you, avoiding excessive disk accesses and system calls. I recommend that you use a system-call level tool (e.g. strace if you're under Linux) to check what actually happens with your IO.

Obviously, dict.insert(xxx) could also be a nuisance if it doesn't allow O(1) insertion.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  青春惊慌失措        
                
              
                            
                2020-12-30 08:34
              
            
            
                                                                       
Quick profiling on my system (linux-2.6.37, gcc-4.5.2, compiled with -O3) shows that I/O is not the bottleneck. Whether using fscanf into a char array followed by dict.insert() or operator>> as in your exact code, it takes about the same time (155 - 160 ms to read a 240k word file).

Replacing gcc's std::unordered_set with std::vector<std::string> in your code drops the execution time to 45 ms (fscanf) - 55 ms (operator>>) for me. Try to profile IO and set insertion separately.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  闹比i        
                
              
                            
                2020-12-30 08:35
              
            
            
                                                                       
You could get better performance, normally, by increasing the buffer size.

Right after building the ifstream, you can set its internal buffer using:

char LocalBuffer[4096]; // buffer

std::ifstream wordListFile("dictionary.txt");

wordListFile.rdbuf()->pubsetbuf(LocalBuffer, 4096);


Note: rdbuf's result is guaranteed no to be null if the construction of ifstream succeeded

Depending on the memory available, your are strongly encouraged to grow the buffer if possible in order to limit interaction with the HDD and the number of system calls.

I've performed some simple measurements using a little benchmark of my own, you can find the code below (and I am interested in critics):


  gcc 3.4.2 on SLES 10 (sp 3)

  C  : 9.52725e+06

  C++: 1.11238e+07

  difference: 1.59655e+06


Which gives a slowdown of a whooping 17%.

This takes into account:


automatic memory management (no buffer overflow)
automatic resources management (no risk to forget to close the file)
handling of locale


So, we can argue that streams are slow... but please, don't throw your random piece of code and complains it's slow, optimization is hard work.



Corresponding code, where benchmark is a little utility of my own which measure the time of a repeated execution (here launched for 50 iterations) using gettimeofday.

#include <fstream>
#include <iostream>
#include <iomanip>

#include <cmath>
#include <cstdio>

#include "benchmark.h"

struct CRead
{
  CRead(char const* filename): _filename(filename) {}

  void operator()()
  {
    FILE* file = fopen(_filename, "r");

    int count = 0;
    while ( fscanf(file,"%s", _buffer) == 1 ) { ++count; }

    fclose(file);
  }

  char const* _filename;
  char _buffer[1024];
};

struct CppRead
{
  CppRead(char const* filename): _filename(filename), _buffer() {}

  enum { BufferSize = 16184 };

  void operator()()
  {
    std::ifstream file(_filename);
    file.rdbuf()->pubsetbuf(_buffer, BufferSize);

    int count = 0;
    std::string s;
    while ( file >> s ) { ++count; }
  }

  char const* _filename;
  char _buffer[BufferSize];
};


int main(int argc, char* argv[])
{
  size_t iterations = 1;
  if (argc > 1) { iterations = atoi(argv[1]); }

  char const* filename = "largefile.txt";

  CRead cread(filename);
  CppRead cppread(filename);

  double ctime = benchmark(cread, iterations);
  double cpptime = benchmark(cppread, iterations);

  std::cout << "C  : " << ctime << "\n"
               "C++: " << cpptime << "\n";

  return 0;
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  独厮守ぢ        
                
              
                            
                2020-12-30 08:35
              
            
            
                                                                       
If you really want fast, ditch istream and string and create a trivial class Read_Only_Text around const char* & size, then memory map the file and insert into unordered_set<Read_Only_Text> with references to the embedded strings.  It will mean you needlessly keep the 2mb file even though your number of unique keys may be much less, but it'll be very, very fast to populate.  I know this is a pain, but I've done it several times for various tasks and the results are very good.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  灰色年华        
                
              
                            
                2020-12-30 08:36
              
            
            
                                                                       
Unfortunately, there's not much you can do to increase performance when using an fstream.

You may be able to get a very slight speed improvement by reading in larger chunks of the file and then parsing out single words, but this depends on how your fstream implementation does buffering.

The only way to get a big improvement is to use your OS's I/O functions. For example, on Windows, opening the file with the FILE_FLAG_SEQUENTIAL_SCAN flag may speed up reads, as well as using asynchronous reads to grab data from disk and parse it in parallel.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复