Concatenate two huge files in C++

与世无争的帅哥 提交于 2021-02-06 08:59:25

问题


I have two std::ofstream text files of a hundred plus megs each and I want to concatenate them. Using fstreams to store the data to create a single file usually ends up with an out of memory error because the size is too big.

Is there any way of merging them faster than O(n)?

File 1 (160MB):

0 1 3 5
7 9 11 13
...
...
9187653 9187655 9187657 9187659 

File 2 (120MB):

a b c d e f g h i j
a b c d e f g h j i
a b c d e f g i h j
a b c d e f g i j h
...
...
j i h g f e d c b a

Merged (380MB):

0 1 3 5
7 9 11 13
...
...
9187653 9187655 9187657 9187659 
a b c d e f g h i j
a b c d e f g h j i
a b c d e f g i h j
a b c d e f g i j h
...
...
j i h g f e d c b a

File generation:

std::ofstream a_file ( "file1.txt" );
std::ofstream b_file ( "file2.txt" );

    while(//whatever){
          a_file << num << endl;
    }

    while(//whatever){
          b_file << character << endl;
    }

    // merge them here, doesn't matter if output is one of them or a new file
    a_file.close();
    b_file.close();

回答1:


Assuming you don't want to do any processing, and just want to concatenate two files to make a third, you can do this very simply by streaming the files' buffers:

std::ifstream if_a("a.txt", std::ios_base::binary);
std::ifstream if_b("b.txt", std::ios_base::binary);
std::ofstream of_c("c.txt", std::ios_base::binary);

of_c << if_a.rdbuf() << if_b.rdbuf();

I have tried this sort of thing with files of up to 100Mb in the past and had no problems. You effectively let C++ and the libraries handle any buffering that's required. It also means that you don't need to worry about file positions if your files get really big.

An alternative is if you just wanted to copy b.txt onto the end of a.txt, in which case you would need to open a.txt with the append flag, and seek to the end:

std::ofstream of_a("a.txt", std::ios_base::binary | std::ios_base::app);
std::ifstream if_b("b.txt", std::ios_base::binary);

of_a.seekp(0, std::ios_base::end);
of_a << if_b.rdbuf();

How these methods work is by passing the std::streambuf of the input streams to the operator<< of the output stream, one of the overrides of which takes a streambuf parameter (operator<<). As mentioned in that link, in the case where there are no errors, the streambuf is inserted unformatted into the output stream until the end of file.




回答2:


Is there any way of merging them faster than O(n)?

That would mean you would process the data without passing through it even once. You cannot interpret it for merging without reading it at least once (short answer: no).

For reading the data, you should consider un-buffered reads (look at std::fstream::read).




回答3:


On Windows:-

system ("copy File1+File2 OutputFile");

on Linux:-

system ("cat File1 File2 > OutputFile");

But the answer is simple - don't read the whole file into memory! Read the input files in small blocks:-

void Cat (input_file, output_file)
{
  while ((bytes_read = read_data (input_file, buffer, buffer_size)) != 0)
  { 
    write_data (output_file, buffer, bytes_read);
  }
}

int main ()
{
   output_file = open output file

   input_file = open input file1
   Cat (input_file, output_file)
   close input_file

   input_file = open input file2
   Cat (input_file, output_file)
   close input_file
}



回答4:


It really depends whether you wish to use "pure" C++ for this, personally at the cost of portability I would be tempted to write:

#include <cstdlib>
#include <sstream>

int main(int argc, char* argv[]) {
    std::ostringstream command;

    command << "cat "; // Linux Only, command for Windows is slightly different

    for (int i = 2; i < argc; ++i) { command << argv[i] << " "; }

    command << "> ";

    command << argv[1];

    return system(command.str().c_str());
}

Is it good C++ code ? No, not really (non-portable and does not escape command arguments).

But it'll get you way ahead of where you are standing now.

As for a "real" C++ solution, with all the ugliness that streams could manage...

#include <fstream>
#include <string>

static size_t const BufferSize = 8192; // 8 KB

void appendFile(std::string const& outFile, std::string const& inFile) {
    std::ofstream out(outFile, std::ios_base::app |
                               std::ios_base::binary |
                               std::ios_base::out);

    std::ifstream in(inFile, std::ios_base::binary |
                             std::ios_base::in);

    std::vector<char> buffer(BufferSize);
    while (in.read(&buffer[0], buffer.size())) {
        out.write(&buffer[0], buffer.size());
    }

    // Fails when "read" encounters EOF,
    // but potentially still writes *some* bytes to buffer!
    out.write(&buffer[0], in.gcount());
}


来源:https://stackoverflow.com/questions/19564450/concatenate-two-huge-files-in-c

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!