I have a very large table. It\'s currently in a MySQL database. I use django.
I need to iterate over each element of the table to pre-compute some parti
The essential answer: use raw SQL with server-side cursors.
Sadly, until Django 1.5.2 there is no formal way to create a server-side MySQL cursor (not sure about other database engines). So I wrote some magic code to solve this problem.
For Django 1.5.2 and MySQLdb 1.2.4, the following code will work. Also, it's well commented.
Caution: This is not based on public APIs, so it will probably break in future Django versions.
# This script should be tested under a Django shell, e.g., ./manage.py shell
from types import MethodType
import MySQLdb.cursors
import MySQLdb.connections
from django.db import connection
from django.db.backends.util import CursorDebugWrapper
def close_sscursor(self):
"""An instance method which replace close() method of the old cursor.
Closing the server-side cursor with the original close() method will be
quite slow and memory-intensive if the large result set was not exhausted,
because fetchall() will be called internally to get the remaining records.
Notice that the close() method is also called when the cursor is garbage
collected.
This method is more efficient on closing the cursor, but if the result set
is not fully iterated, the next cursor created from the same connection
won't work properly. You can avoid this by either (1) close the connection
before creating a new cursor, (2) iterate the result set before closing
the server-side cursor.
"""
if isinstance(self, CursorDebugWrapper):
self.cursor.cursor.connection = None
else:
# This is for CursorWrapper object
self.cursor.connection = None
def get_sscursor(connection, cursorclass=MySQLdb.cursors.SSCursor):
"""Get a server-side MySQL cursor."""
if connection.settings_dict['ENGINE'] != 'django.db.backends.mysql':
raise NotImplementedError('Only MySQL engine is supported')
cursor = connection.cursor()
if isinstance(cursor, CursorDebugWrapper):
# Get the real MySQLdb.connections.Connection object
conn = cursor.cursor.cursor.connection
# Replace the internal client-side cursor with a sever-side cursor
cursor.cursor.cursor = conn.cursor(cursorclass=cursorclass)
else:
# This is for CursorWrapper object
conn = cursor.cursor.connection
cursor.cursor = conn.cursor(cursorclass=cursorclass)
# Replace the old close() method
cursor.close = MethodType(close_sscursor, cursor)
return cursor
# Get the server-side cursor
cursor = get_sscursor(connection)
# Run a query with a large result set. Notice that the memory consumption is low.
cursor.execute('SELECT * FROM million_record_table')
# Fetch a single row, fetchmany() rows or iterate it via "for row in cursor:"
cursor.fetchone()
# You can interrupt the iteration at any time. This calls the new close() method,
# so no warning is shown.
cursor.close()
# Connection must be close to let new cursors work properly. see comments of
# close_sscursor().
connection.close()
If you just need to iterate over the table itself without doing anything fancy, Django comes with a builtin iterator:
queryset.iterator()
This causes Django to clean up it's own cache to reduce memory use. Note that for truly large tables, this may not be enough.
If you are doing something more complex with each object or have a lot of data, you have to write your own. The following is a queryset iterator that splits the queryset into chunks and is not much slower than the basic iterator (it will be a linear number of database queries, as opposed to 1, but it will only one query per 1,000 rows). This function pages by primary key, which is necessary for efficient implementation since offset is a linear time operation in most SQL databases.
def queryset_iterator(queryset, page_size=1000):
if not queryset:
return
max_pk = queryset.order_by("-pk")[0].pk
# Scale the page size up by the average density of primary keys in the queryset
adjusted_page_size = int(page_size * max_pk / queryset.count())
pages = int(max_pk / adjusted_page_size) + 1
for page_num in range(pages):
lower = page_num * adjusted_page_size
page = queryset.filter(pk__gte=lower, pk__lt=lower+page_size)
for obj in page:
yield obj
Use looks like:
for obj in queryset_iterator(Model.objects.all()):
# do stuff
This code has three assumptions:
adjusted_page_size
can end up too large and you may get one or several massive pages as part of your iteration.To give a sense of the overhead, I tested this on a Postgres table with 40,000 entries. The queryset_iterator adds about 80% to the iteration time vs raw iteration (2.2 seconds vs 1.2 seconds). That overhead does not vary substantially for page sizes between 200 and 10,000, though it starts going up below 200.
There is another option available. It wouldn't make the iteration faster, (in fact it would probably slow it down), but it would make it use far less memory. Depending on your needs this may be appropriate.
large_qs = MyModel.objects.all().values_list("id", flat=True)
for model_id in large_qs:
model_object = MyModel.objects.get(id=model_id)
# do whatever you need to do with the model here
Only the ids are loaded into memory, and the objects are retrieved and discarded as needed. Note the increased database load and slower runtime, both tradeoffs for the reduction in memory usage.
I've used this when running async scheduled tasks on worker instances, for which it doesn't really matter if they are slow, but if they try to use way too much memory they may crash the instance and therefore abort the process.