I am looking to optimize reading/writing huge data for a C++ simulation application. The data termed as a \"map\" essentially consists of integers, doubles, floats and a single
Maybe not relevant in this case, but I managed to increase performances in an application with heavy file read and write by writing compressed data (zlib), and decompressing on the fly, the decreased read/write time versus the increased CPU load being a win.
Alternatively, if your problem is that the amount of data does not fit in memory and you want to use the disk as a cache, you can look into memcached, which provides a scalable and distributed memory cache.
The effectiveness of this idea depends on your pattern of access, but if you are not looking at that variable size data each cycle, you might speed up access by rearranging your file structure:
Instead of writing a direct dump of a structure like this:
struct {
int x;
enum t;
int sz
char variable_data[sz];
};
you could write all the fixed size parts up front, then store the variable portions afterward:
struct {
int x;
enum t;
int sz;
long offset_to_variable_data;
};
Now, as you parse the file each cycle, you can linearly read N records at a time. You will only have to deal with fseek when you need to fetch the variable-sized data. You might even consider keeping that variable portion in a separate file so that you also only read forward through that file.
This strategy may even improve your performance if you do go with a memory-mapped file as others suggested.
Use memory mapped file (http://en.wikipedia.org/wiki/Memory-mapped_file);
"millions" maps do not sound like a lot of data. What prevents you from keeping all data in memory?
Another option is to use some standard file format suitable for your needs e.g., sqlite (use SQL to store/retrieve data) or some specialized format like hdf5 or define you own format using something like Google Protocol Buffers.
Store the computed data in a relational database.
Since you do not mention an OS that you are running this on, have you looked at memory mapping the file and then using standard memory routines to "walk" the file as you go along?
This way you are not using fseek/fread instead you are using pointer arithmetic. Here is an mmap example to copy one file from a source file to a destination file. This may improve the performance.
Other things you could look into, is splitting the files up into smaller files, and using a hash value corresponding to the time unit to close then open the next file to continue the simulation, this way dealing with smaller files that can be more aggressively cached by the host OS!