I have a very large text file(45GB). Each line of the text file contains two space separated 64bit unsigned integers as shown below.
4624996948753406865 10214715013
You can memory map the file into memory, but that would be platform dependent (on unix that would be mmap on windows CreateFileMapping/MapViewIntoFile); still if in 32 bit system you may have problems if there is not a big enough virtual memory block left (64 bit systems would not have that problem).
Memory mapping is supposed to be faster than reading the data block by block from disk.
I can only guess that the bottleneck is in:
string str(memblock);
-Because you allocate a 45MB long segment in memory.
You should read the file line by line, as described in here:
In order to profile your program, you can print clock() between each line, as described in:
I'd redesign this to act streaming, instead of on a block.
A simpler approach would be:
std::ifstream ifs("input.txt");
std::vector<uint64_t> parsed(std::istream_iterator<uint64_t>(ifs), {});
If you know roughly how many values are expected, using std::vector::reserve
up front could speed it up further.
Alternatively you can use a memory mapped file and iterate over the character sequence.
Update I modified the above program to parse uint32_t
s into a vector.
When using a sample input file of 4.5GiB[1] the program runs in 9 seconds[2]:
sehe@desktop:/tmp$ make -B && sudo chrt -f 99 /usr/bin/time -f "%E elapsed, %c context switches" ./test smaller.txt
g++ -std=c++0x -Wall -pedantic -g -O2 -march=native test.cpp -o test -lboost_system -lboost_iostreams -ltcmalloc
parse success
trailing unparsed: '
'
data.size(): 402653184
0:08.96 elapsed, 6 context switches
Of course it allocates at least 402653184 * 4 * byte = 1.5 gibibytes. So when you read a 45 GB file, you will need an estimated 15GiB of RAM to just store the vector (assuming no fragmentation on reallocation): The 45GiB parse completes in 10min 45s:
make && sudo chrt -f 99 /usr/bin/time -f "%E elapsed, %c context switches" ./test 45gib_uint32s.txt
make: Nothing to be done for `all'.
tcmalloc: large alloc 17570324480 bytes == 0x2cb6000 @ 0x7ffe6b81dd9c 0x7ffe6b83dae9 0x401320 0x7ffe6af4cec5 0x40176f (nil)
Parse success
Trailing unparsed: 1 characters
Data.size(): 4026531840
Time taken by parsing: 644.64s
10:45.96 elapsed, 42 context switches
By comparison, just running wc -l 45gib_uint32s.txt
took ~12 minutes (without realtime priority scheduling though). wc
is blazingly fast
#include <boost/spirit/include/qi.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <chrono>
namespace qi = boost::spirit::qi;
typedef std::vector<uint32_t> data_t;
using hrclock = std::chrono::high_resolution_clock;
int main(int argc, char** argv) {
if (argc<2) return 255;
data_t data;
data.reserve(4392580288); // for the 45 GiB file benchmark
// data.reserve(402653284); // for the 4.5 GiB file benchmark
boost::iostreams::mapped_file mmap(argv[1], boost::iostreams::mapped_file::readonly);
auto f = mmap.const_data();
auto l = f + mmap.size();
using namespace qi;
auto start_parse = hrclock::now();
bool ok = phrase_parse(f,l,int_parser<uint32_t, 10>() % eol, blank, data);
auto stop_time = hrclock::now();
if (ok)
std::cout << "Parse success\n";
else
std::cerr << "Parse failed at #" << std::distance(mmap.const_data(), f) << " around '" << std::string(f,f+50) << "'\n";
if (f!=l)
std::cerr << "Trailing unparsed: " << std::distance(f,l) << " characters\n";
std::cout << "Data.size(): " << data.size() << "\n";
std::cout << "Time taken by parsing: " << std::chrono::duration_cast<std::chrono::milliseconds>(stop_time-start_parse).count() / 1000.0 << "s\n";
}
[1] generated with od -t u4 /dev/urandom -A none -v -w4 | pv | dd bs=1M count=$((9*1024/2)) iflag=fullblock > smaller.txt
[2] obviously, this was with the file cached in the buffer cache on linux - the large file doesn't have this benefit
On Linux, using C <stdio.h>
instead of C++ streams might help performance (because C++ streams are built above FILE
-s). You could use readline(3) or fgets(3) or fscanf(3). You might set a larger buffer (e.g. 64Kbytes or 256Kbytes) using setbuffer(3) etc... But I guess your (improved) program would be I/O bound, not CPU bound.
Then you could play with posix_fadvise(2)
You might consider using memory mapping mmap(2) & madvise(2) (see also m
mode for fopen(3)). See also readahead(2)
At last, if your algorithm permits it, you might csplit
the files in smaller pieces and process each of them in parallel processes.