getline while reading a file vs reading whole file and then splitting based on newline character

后端 未结 6 1282
孤街浪徒
孤街浪徒 2020-12-16 17:09

I want to process each line of a file on a hard-disk now. Is it better to load a file as a whole and then split on basis of newline character (using boost), or is it better

相关标签:
6条回答
  • 2020-12-16 17:33

    If it's a small file on disk, it's probably more efficient to read the entire file and parse it line by line vs. reading one line at a time--that would take lot's of disk access.

    0 讨论(0)
  • 2020-12-16 17:36

    getline will call read() as a system call somewhere deep in the gutst of the C library. Exactly how many times it is called, and how it is called depends on the C library design. But most likely there is no distinct difference in reading a line at a time vs. the whole file, becuse the OS at the bottom layer will read (at least) one disk-block at a time, and most likely at least a "page" (4KB), if not more.

    Further, unles you do nearly nothing with your string after you have read it (e.g you are writing something like "grep", so mostly just reading the to find a string), it is unlikely that the overhead of reading a line at a time is a large part of the time you spend.

    But the "load the whole file in one go" has several, distinct, problems:

    1. You don't start processing until you have read the whole file.
    2. You need enough memory to read the entire file into memory - what if the file is a few hundred GB in size? Does your program fail then?

    Don't try to optimise something unless you have used profiling to prove that it's part of why your code is running slow. You are just causing more problems for yourself.

    Edit: So, I wrote a program to measure this, since I think it's quite interesting.

    And the results are definitely interesting - to make the comparison fair, I created three large files of 1297984192 bytes each (by copying all source files in a directory with about a dozen different source files, then copying this file several times over to "multiply" it, until it took over 1.5 seconds to run the test, which is how long I think you need to run things to make sure the timing isn't too susceptible to random "network packet came in" or some other outside influences taking time out of the process).

    I also decided to measure the system and user-time by the process.

    $ ./bigfile
    Lines=24812608
    Wallclock time for mmap is 1.98 (user:1.83 system: 0.14)
    Lines=24812608
    Wallclock time for getline is 2.07 (user:1.68 system: 0.389)
    Lines=24812608
    Wallclock time for readwhole is 2.52 (user:1.79 system: 0.723)
    $ ./bigfile
    Lines=24812608
    Wallclock time for mmap is 1.96 (user:1.83 system: 0.12)
    Lines=24812608
    Wallclock time for getline is 2.07 (user:1.67 system: 0.392)
    Lines=24812608
    Wallclock time for readwhole is 2.48 (user:1.76 system: 0.707)
    

    Here's the three different functions to read the file (there's some code to measure time and stuff too, of course, but for reducing the size of this post, I choose to not post all of that - and I played around with ordering to see if that made any difference, so results above are not in the same order as the functions here)

    void func_readwhole(const char *name)
    {
        string fullname = string("bigfile_") + name;
        ifstream f(fullname.c_str());
    
        if (!f) 
        {
            cerr << "could not open file for " << fullname << endl;
            exit(1);
        }
    
        f.seekg(0, ios::end);
        streampos size = f.tellg();
    
        f.seekg(0, ios::beg);
    
        char* buffer = new char[size];
        f.read(buffer, size);
        if (f.gcount() != size)
        {
            cerr << "Read failed ...\n";
            exit(1);
        }
    
        stringstream ss;
        ss.rdbuf()->pubsetbuf(buffer, size);
    
        int lines = 0;
        string str;
        while(getline(ss, str))
        {
            lines++;
        }
    
        f.close();
    
    
        cout << "Lines=" << lines << endl;
    
        delete [] buffer;
    }
    
    void func_getline(const char *name)
    {
        string fullname = string("bigfile_") + name;
        ifstream f(fullname.c_str());
    
        if (!f) 
        {
            cerr << "could not open file for " << fullname << endl;
            exit(1);
        }
    
        string str;
        int lines = 0;
    
        while(getline(f, str))
        {
            lines++;
        }
    
        cout << "Lines=" << lines << endl;
    
        f.close();
    }
    
    void func_mmap(const char *name)
    {
        char *buffer;
    
        string fullname = string("bigfile_") + name;
        int f = open(fullname.c_str(), O_RDONLY);
    
        off_t size = lseek(f, 0, SEEK_END);
    
        lseek(f, 0, SEEK_SET);
    
        buffer = (char *)mmap(NULL, size, PROT_READ, MAP_PRIVATE, f, 0);
    
    
        stringstream ss;
        ss.rdbuf()->pubsetbuf(buffer, size);
    
        int lines = 0;
        string str;
        while(getline(ss, str))
        {
            lines++;
        }
    
        munmap(buffer, size);
        cout << "Lines=" << lines << endl;
    }
    
    0 讨论(0)
  • 2020-12-16 17:40

    Its better to fetch the all data if it can be accommodated in memory because whenever you request the I/O your programmme looses the processing and put in a wait Q.

    enter image description here

    However if the file size is big then it's better to read as much data at a time which is required in processing. Because bigger read operation will take much time to complete then the small ones. cpu process switching time is much smaller then this entire file read time.

    0 讨论(0)
  • 2020-12-16 17:45

    I believe the C++ idiom would be to read the file line-by-line, and create a line-based container as you read the file. Most likely the iostreams (getline) will be buffered enough that you won't notice a significant difference.

    However for very large files you may get better performance by reading larger chunks of the file (not the whole file at once) and splitting internall as newlines are found.

    If you want to know specifically which method is faster and by how much, you'll have to profile your code.

    0 讨论(0)
  • 2020-12-16 17:52

    The fstreams are buffered reasonably. The underlying acesses to the harddisk by the OS are buffered reasonably. The hard disk itself has a reasonable buffer. You most surely will not trigger more hard disk accesses if you read the file line by line. Or character by character, for that matter.

    So there is no reason to load the whole file into a big buffer and work on that buffer, because it already is in a buffer. And there often is no reason to buffer one line at a time, either. Why allocate memory to buffer something in a string that is already buffered in the ifstream? If you can, work on the stream directly, don't bother tossing everything around twice or more from one buffer to the next. Unless it supports readability and/or your profiler told you that disc access is slowing your program down significantly.

    0 讨论(0)
  • 2020-12-16 17:55

    The OS will read a whole block of data (depending on how the disk is formatted, typically 4-8k at a time) and do some of the buffering for you. Let the OS take care of it for you, and read the data in the way that makes sense for your program.

    0 讨论(0)
提交回复
热议问题