Efficiently reading a very large text file in C++

一个人想着一个人 提交于 2019-12-28 03:38:28

问题


I have a very large text file(45GB). Each line of the text file contains two space separated 64bit unsigned integers as shown below.

4624996948753406865 10214715013130414417

4305027007407867230 4569406367070518418

10817905656952544704 3697712211731468838 ... ...

I want to read the file and perform some operations on the numbers.

My Code in C++:

void process_data(string str)
{
    vector<string> arr;
    boost::split(arr, str, boost::is_any_of(" \n"));
    do_some_operation(arr);
}

int main()
{
    unsigned long long int read_bytes = 45 * 1024 *1024;
    const char* fname = "input.txt";
    ifstream fin(fname, ios::in);
    char* memblock;

    while(!fin.eof())
    {
        memblock = new char[read_bytes];
        fin.read(memblock, read_bytes);
        string str(memblock);
        process_data(str);
        delete [] memblock;
    }
    return 0;
}

I am relatively new to c++. When I run this code, I am facing these problems.

  1. Because of reading the file in bytes, sometimes the last line of a block corresponds to an unfinished line in the original file("4624996948753406865 10214" instead of the actual string "4624996948753406865 10214715013130414417" of the main file).

  2. This code runs very very slow. It takes around 6secs to run for one block operations in a 64bit Intel Core i7 920 system with 6GB of RAM. Is there any optimization techniques that I can use to improve the runtime?

  3. Is it necessary to include "\n" along with blank character in the boost split function?

I have read about mmap files in C++ but I am not sure whether it's the correct way to do so. If yes, please attach some links.


回答1:


I'd redesign this to act streaming, instead of on a block.

A simpler approach would be:

std::ifstream ifs("input.txt");
std::vector<uint64_t> parsed(std::istream_iterator<uint64_t>(ifs), {});

If you know roughly how many values are expected, using std::vector::reserve up front could speed it up further.


Alternatively you can use a memory mapped file and iterate over the character sequence.

  • How to parse space-separated floats in C++ quickly? shows these approaches with benchmarks for floats.

Update I modified the above program to parse uint32_ts into a vector.

When using a sample input file of 4.5GiB[1] the program runs in 9 seconds[2]:

sehe@desktop:/tmp$ make -B && sudo chrt -f 99 /usr/bin/time -f "%E elapsed, %c context switches" ./test smaller.txt
g++ -std=c++0x -Wall -pedantic -g -O2 -march=native test.cpp -o test -lboost_system -lboost_iostreams -ltcmalloc
parse success
trailing unparsed: '
'
data.size():   402653184
0:08.96 elapsed, 6 context switches

Of course it allocates at least 402653184 * 4 * byte = 1.5 gibibytes. So when you read a 45 GB file, you will need an estimated 15GiB of RAM to just store the vector (assuming no fragmentation on reallocation): The 45GiB parse completes in 10min 45s:

make && sudo chrt -f 99 /usr/bin/time -f "%E elapsed, %c context switches" ./test 45gib_uint32s.txt 
make: Nothing to be done for `all'.
tcmalloc: large alloc 17570324480 bytes == 0x2cb6000 @  0x7ffe6b81dd9c 0x7ffe6b83dae9 0x401320 0x7ffe6af4cec5 0x40176f (nil)
Parse success
Trailing unparsed: 1 characters
Data.size():   4026531840
Time taken by parsing: 644.64s
10:45.96 elapsed, 42 context switches

By comparison, just running wc -l 45gib_uint32s.txt took ~12 minutes (without realtime priority scheduling though). wc is blazingly fast

Full Code Used For Benchmark

#include <boost/spirit/include/qi.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <chrono>

namespace qi = boost::spirit::qi;

typedef std::vector<uint32_t> data_t;

using hrclock = std::chrono::high_resolution_clock;

int main(int argc, char** argv) {
    if (argc<2) return 255;
    data_t data;
    data.reserve(4392580288);   // for the  45 GiB file benchmark
    // data.reserve(402653284); // for the 4.5 GiB file benchmark

    boost::iostreams::mapped_file mmap(argv[1], boost::iostreams::mapped_file::readonly);
    auto f = mmap.const_data();
    auto l = f + mmap.size();

    using namespace qi;

    auto start_parse = hrclock::now();
    bool ok = phrase_parse(f,l,int_parser<uint32_t, 10>() % eol, blank, data);
    auto stop_time = hrclock::now();

    if (ok)   
        std::cout << "Parse success\n";
    else 
        std::cerr << "Parse failed at #" << std::distance(mmap.const_data(), f) << " around '" << std::string(f,f+50) << "'\n";

    if (f!=l) 
        std::cerr << "Trailing unparsed: " << std::distance(f,l) << " characters\n";

    std::cout << "Data.size():   " << data.size() << "\n";
    std::cout << "Time taken by parsing: " << std::chrono::duration_cast<std::chrono::milliseconds>(stop_time-start_parse).count() / 1000.0 << "s\n";
}

[1] generated with od -t u4 /dev/urandom -A none -v -w4 | pv | dd bs=1M count=$((9*1024/2)) iflag=fullblock > smaller.txt

[2] obviously, this was with the file cached in the buffer cache on linux - the large file doesn't have this benefit




回答2:


I can only guess that the bottleneck is in:

string str(memblock);

-Because you allocate a 45MB long segment in memory.

You should read the file line by line, as described in here:

  • Read file line by line

In order to profile your program, you can print clock() between each line, as described in:

  • Easily measure elapsed time



回答3:


You can memory map the file into memory, but that would be platform dependent (on unix that would be mmap on windows CreateFileMapping/MapViewIntoFile); still if in 32 bit system you may have problems if there is not a big enough virtual memory block left (64 bit systems would not have that problem).

Memory mapping is supposed to be faster than reading the data block by block from disk.




回答4:


On Linux, using C <stdio.h> instead of C++ streams might help performance (because C++ streams are built above FILE-s). You could use readline(3) or fgets(3) or fscanf(3). You might set a larger buffer (e.g. 64Kbytes or 256Kbytes) using setbuffer(3) etc... But I guess your (improved) program would be I/O bound, not CPU bound. Then you could play with posix_fadvise(2)

You might consider using memory mapping mmap(2) & madvise(2) (see also m mode for fopen(3)). See also readahead(2)

At last, if your algorithm permits it, you might csplit the files in smaller pieces and process each of them in parallel processes.



来源:https://stackoverflow.com/questions/26736742/efficiently-reading-a-very-large-text-file-in-c

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!