问题
I have a large set of data chunks (~50GB). In my code I have to be able to do the following things:
Repeatedly iterate over all chunks and do some computations on them.
Repeatedly iterate over all chunks and do some computations on them, where in each iteration the order of visited chunks is (as far as possible) randomized.
So far, I have split the data into 10 binary files (created with boost::serialization
) and repeatedly read one after the other and perform the computations. For (2), I read the 10 files in random order and process each one in sequence, which is good enough.
However, reading the one of the files (using boost::serialization
) takes a long time and I'd like to speed it up.
Can I use memory mapped files instead of boost::serialization
?
In particular, I'd have a vector<Chunk*>
in each file. I want to be able to read in such a file very, very quickly.
How can I read/write such a vector<Chunk*>
data structure? I have looked at boost::interprocess::file_mapping
, but I'm not sure how to do it.
I read this (http://boost.cowic.de/rc/pdf/interprocess.pdf), but it doesn't say much about memory mapped files. I think I'd store the vector<Chunk*>
first in the mapped memory, then store the Chunks themselves. And, vector<Chunk*>
would actually become offset_ptr<Chunk>*
, i.e., an array of offset_ptr?
回答1:
A memory mapped file is a chunk of memory, as any other memory it may be organized in bytes, little endian words, bits, or any other data structure. If portability is a concern (e.g. endianness) some care is needed.
The following code may be a good starting point:
#include <cstdint>
#include <memory>
#include <vector>
#include <iostream>
#include <boost/iostreams/device/mapped_file.hpp>
struct entry {
std::uint32_t a;
std::uint64_t b;
} __attribute__((packed)); /* compiler specific, but supported
in other ways by all major compilers */
static_assert(sizeof(entry) == 12, "entry: Struct size mismatch");
static_assert(offsetof(entry, a) == 0, "entry: Invalid offset for a");
static_assert(offsetof(entry, b) == 4, "entry: Invalid offset for b");
int main(void) {
boost::iostreams::mapped_file_source mmap("map");
assert(mmap.is_open());
const entry* data_begin = reinterpret_cast<const entry*>(mmap.data());
const entry* data_end = data_begin + mmap.size()/sizeof(entry);
for(const entry* ii=data_begin; ii!=data_end; ++ii)
std::cout << std::hex << ii->a << " " << ii->b << std::endl;
return 0;
}
The data_begin and data_end pointers can be used with most STL functions as any other iterator.
来源:https://stackoverflow.com/questions/19531243/how-to-read-write-vectorchunk-as-memory-mapped-files