Let\'s say I have a table with millions of rows. Using JPA, what\'s the proper way to iterate over a query against that table, such that I don\'t have all an in-memo
To be honest, I would suggest leaving JPA and stick with JDBC (but certainly using JdbcTemplate
support class or such like). JPA (and other ORM providers/specifications) is not designed to operate on many objects within one transaction as they assume everything loaded should stay in first-level cache (hence the need for clear()
in JPA).
Also I am recommending more low level solution because the overhead of ORM (reflection is only a tip of an iceberg) might be so significant, that iterating over plain ResultSet
, even using some lightweight support like mentioned JdbcTemplate
will be much faster.
JPA is simply not designed to perform operations on a large amount of entities. You might play with flush()
/clear()
to avoid OutOfMemoryError
, but consider this once again. You gain very little paying the price of huge resource consumption.
There is no "proper" what to do this, this isn't what JPA or JDO or any other ORM is intended to do, straight JDBC will be your best alternative, as you can configure it to bring back a small number of rows at a time and flush them as they are used, that is why server side cursors exist.
ORM tools are not designed for bulk processing, they are designed to let you manipulate objects and attempt to make the RDBMS that the data is stored in be as transparent as possible, most fail at the transparent part at least to some degree. At this scale, there is no way to process hundreds of thousands of rows ( Objects ), much less millions with any ORM and have it execute in any reasonable amount of time because of the object instantiation overhead, plain and simple.
Use the appropriate tool. Straight JDBC and Stored Procedures definitely have a place in 2011, especially at what they are better at doing versus these ORM frameworks.
Pulling a million of anything, even into a simple List<Integer>
is not going to be very efficient regardless of how you do it. The correct way to do what you are asking is a simple SELECT id FROM table
, set to SERVER SIDE
( vendor dependent ) and the cursor to FORWARD_ONLY READ-ONLY
and iterate over that.
If you are really pulling millions of id's to process by calling some web server with each one, you are going to have to do some concurrent processing as well for this to run in any reasonable amount of time. Pulling with a JDBC cursor and placing a few of them at a time in a ConcurrentLinkedQueue and having a small pool of threads ( # CPU/Cores + 1 ) pull and process them is the only way to complete your task on a machine with any "normal" amount of RAM, given you are already running out of memory.
See this answer as well.
I have wondered this myself. It seems to matter:
I have written an Iterator to make it easy to swap out both approaches (findAll vs findEntries).
I recommend you try both.
Long count = entityManager().createQuery("select count(o) from Model o", Long.class).getSingleResult();
ChunkIterator<Model> it1 = new ChunkIterator<Model>(count, 2) {
@Override
public Iterator<Model> getChunk(long index, long chunkSize) {
//Do your setFirst and setMax here and return an iterator.
}
};
Iterator<Model> it2 = List<Model> models = entityManager().createQuery("from Model m", Model.class).getResultList().iterator();
public static abstract class ChunkIterator<T>
extends AbstractIterator<T> implements Iterable<T>{
private Iterator<T> chunk;
private Long count;
private long index = 0;
private long chunkSize = 100;
public ChunkIterator(Long count, long chunkSize) {
super();
this.count = count;
this.chunkSize = chunkSize;
}
public abstract Iterator<T> getChunk(long index, long chunkSize);
@Override
public Iterator<T> iterator() {
return this;
}
@Override
protected T computeNext() {
if (count == 0) return endOfData();
if (chunk != null && chunk.hasNext() == false && index >= count)
return endOfData();
if (chunk == null || chunk.hasNext() == false) {
chunk = getChunk(index, chunkSize);
index += chunkSize;
}
if (chunk == null || chunk.hasNext() == false)
return endOfData();
return chunk.next();
}
}
I ended up not using my chunk iterator (so it might not be that tested). By the way you will need google collections if you want to use it.
Page 537 of Java Persistence with Hibernate gives a solution using ScrollableResults
, but alas it's only for Hibernate.
So it seems that using setFirstResult
/setMaxResults
and manual iteration really is necessary. Here's my solution using JPA:
private List<Model> getAllModelsIterable(int offset, int max)
{
return entityManager.createQuery("from Model m", Model.class).setFirstResult(offset).setMaxResults(max).getResultList();
}
then, use it like this:
private void iterateAll()
{
int offset = 0;
List<Model> models;
while ((models = Model.getAllModelsIterable(offset, 100)).size() > 0)
{
entityManager.getTransaction().begin();
for (Model model : models)
{
log.info("do something with model: " + model.getId());
}
entityManager.flush();
entityManager.clear();
em.getTransaction().commit();
offset += models.size();
}
}
With hibernate there are 4 different ways to achieve what you want. Each has design tradeoffs, limitations, and consequences. I suggest exploring each and deciding which is right for your situation.
If you use EclipseLink I' using this method to get result as Iterable
private static <T> Iterable<T> getResult(TypedQuery<T> query)
{
//eclipseLink
if(query instanceof JpaQuery) {
JpaQuery<T> jQuery = (JpaQuery<T>) query;
jQuery.setHint(QueryHints.RESULT_SET_TYPE, ResultSetType.ForwardOnly)
.setHint(QueryHints.SCROLLABLE_CURSOR, true);
final Cursor cursor = jQuery.getResultCursor();
return new Iterable<T>()
{
@SuppressWarnings("unchecked")
@Override
public Iterator<T> iterator()
{
return cursor;
}
};
}
return query.getResultList();
}
close Method
static void closeCursor(Iterable<?> list)
{
if (list.iterator() instanceof Cursor)
{
((Cursor) list.iterator()).close();
}
}