I have written a converter that takes openstreetmap xml files and converts them to a binary runtime rendering format that is typically about 10% of the original size. Input file
You need to stream your output as well as your input. If your output format is not stream-oriented, consider doing second pass. For example, if the output file starts with check sum/size of the data, leave space on the first pass and seek/write to that space later.
there's a good technique for that, is to store some instances into files, and after getting them when you need to use them.
this technique is used by many open source software like Doxygen to be scalable when a big quantity of memory is needed.
This is an old question but, since I've recently done the same thing ....
There is no simple answer. In an ideal world you'd use a machine with huge address space (ie 64 bit), and massive amounts of physical memory. Huge address space alone is not sufficient or it'll just thrash. In that case parse the XML file into a database, and with appropriate queries, pull out what you need. Quite likely this is what OSM itself does (I believe the world is about 330GB).
In reality I'm still using XP 32bit for reasons of expediency.
It's a trade off between space and speed. You can do pretty much anything in any amount of memory providing you don't care how long it takes. Using STL structures you can parse anything you want, but you'll soon run out of memory. You can define your own allocators that swap, but again, it'll be inefficient because the maps, vectors, sets etc do not really know what you are doing.
The only way I found to make it all work in a small footprint on a 32 bit machine was to think very carefully about what I was doing and what was needed when and break the task into chunks. Memory efficient (never uses more than ~100MB) but not massively quick, but then it doesn't matter - how often does one have to parse the XML data?
Assuming you are using Windows XP, if you are only just over your memory limit and do not desire or have the time to rework the code as suggested above, you can add the /3GB switch to your boot.ini file and then it just a matter of setting a linker switch to get an extra 1GB of memory.
You don't need to switch to 64-bit machines, nor you need most of the 1000 things suggested by others. What you need is a more thoughtful algorithm.
Here are some things you can do to help out with this situation:
Finally, let me point out that complex tasks require complex measures. If you think you can afford a 64-bit machine with 8GB of RAM, then just use "read file into memory, process data, write output" algorithm, even if it takes a day to finish.
First, on a 32-bit system, you will always be limited to 4 GB of memory, no matter pagefile settings. (And of those, only 2GB will be available to your process on Windows. On Linux, you'll typically have around 3GB available)
So the first obvious solution is to switch to a 64-bit OS, and compile your application for 64-bit. That gives you a huge virtual memory space to use, and the OS will swap data in and out of the pagefile as necessary to keep things working.
Second, allocating smaller chunks of memory at a time may help. It's often easier to find 4 256MB chunks of free memory than one 1GB chunk.
Third, split up the problem. Don't process the entire dataset at once, but try to load and process only a small section at a time.