I\'m really puzzled why it keeps dying with java.lang.OutOfMemoryError during indexing even though it has a few GBs of memory.
Is there a fundamental reason why it needs
I'm not certain there is a steadfast way to ensure you won't run into OutOfMemoryExceptions with Lucene. The problem you are facing is problem related to the use of FieldCache. From the Lucene API "Maintains caches of term values.". If your terms exceed the amount of memory allocated to the JVM you'll get the exception.
The documents are being sorted "at org.apache.lucene.search.FieldComparator$StringOrdValComparator.setNextReader(FieldComparator.java:667)", which will take up as much memory as is needed to store the terms being sorted for the index.
You'll need to review projected size of the fields that are sortable and adjust the JVM settings accordingly.