Memory efficient (constant) and speed optimized iteration over a large table in Django

前端未结

关注

 3  1036

感动是毒

I have a very large table. It\'s currently in a MySQL database. I use django.

I need to iterate over each element of the table to pre-compute some parti

相关标签:

3条回答

刺人心

2021-01-30 05:43

The essential answer: use raw SQL with server-side cursors.

Sadly, until Django 1.5.2 there is no formal way to create a server-side MySQL cursor (not sure about other database engines). So I wrote some magic code to solve this problem.

For Django 1.5.2 and MySQLdb 1.2.4, the following code will work. Also, it's well commented.

Caution: This is not based on public APIs, so it will probably break in future Django versions.

# This script should be tested under a Django shell, e.g., ./manage.py shell

from types import MethodType

import MySQLdb.cursors
import MySQLdb.connections
from django.db import connection
from django.db.backends.util import CursorDebugWrapper


def close_sscursor(self):
    """An instance method which replace close() method of the old cursor.

    Closing the server-side cursor with the original close() method will be
    quite slow and memory-intensive if the large result set was not exhausted,
    because fetchall() will be called internally to get the remaining records.
    Notice that the close() method is also called when the cursor is garbage 
    collected.

    This method is more efficient on closing the cursor, but if the result set
    is not fully iterated, the next cursor created from the same connection
    won't work properly. You can avoid this by either (1) close the connection 
    before creating a new cursor, (2) iterate the result set before closing 
    the server-side cursor.
    """
    if isinstance(self, CursorDebugWrapper):
        self.cursor.cursor.connection = None
    else:
        # This is for CursorWrapper object
        self.cursor.connection = None


def get_sscursor(connection, cursorclass=MySQLdb.cursors.SSCursor):
    """Get a server-side MySQL cursor."""
    if connection.settings_dict['ENGINE'] != 'django.db.backends.mysql':
        raise NotImplementedError('Only MySQL engine is supported')
    cursor = connection.cursor()
    if isinstance(cursor, CursorDebugWrapper):
        # Get the real MySQLdb.connections.Connection object
        conn = cursor.cursor.cursor.connection
        # Replace the internal client-side cursor with a sever-side cursor
        cursor.cursor.cursor = conn.cursor(cursorclass=cursorclass)
    else:
        # This is for CursorWrapper object
        conn = cursor.cursor.connection
        cursor.cursor = conn.cursor(cursorclass=cursorclass)
    # Replace the old close() method
    cursor.close = MethodType(close_sscursor, cursor)
    return cursor


# Get the server-side cursor
cursor = get_sscursor(connection)

# Run a query with a large result set. Notice that the memory consumption is low.
cursor.execute('SELECT * FROM million_record_table')

# Fetch a single row, fetchmany() rows or iterate it via "for row in cursor:"
cursor.fetchone()

# You can interrupt the iteration at any time. This calls the new close() method,
# so no warning is shown.
cursor.close()

# Connection must be close to let new cursors work properly. see comments of
# close_sscursor().
connection.close()

0 讨论(0)

暗喜

2021-01-30 05:45
Simple Answer

If you just need to iterate over the table itself without doing anything fancy, Django comes with a builtin iterator:
```
queryset.iterator()
```
This causes Django to clean up it's own cache to reduce memory use. Note that for truly large tables, this may not be enough.

Complex Answer

If you are doing something more complex with each object or have a lot of data, you have to write your own. The following is a queryset iterator that splits the queryset into chunks and is not much slower than the basic iterator (it will be a linear number of database queries, as opposed to 1, but it will only one query per 1,000 rows). This function pages by primary key, which is necessary for efficient implementation since offset is a linear time operation in most SQL databases.
```
def queryset_iterator(queryset, page_size=1000):
    if not queryset:
        return
    max_pk = queryset.order_by("-pk")[0].pk
    # Scale the page size up by the average density of primary keys in the queryset
    adjusted_page_size = int(page_size * max_pk / queryset.count())
    
    pages = int(max_pk / adjusted_page_size) + 1
    for page_num in range(pages):
        lower = page_num * adjusted_page_size
        page = queryset.filter(pk__gte=lower, pk__lt=lower+page_size)
        for obj in page:
            yield obj
```
Use looks like:
```
for obj in queryset_iterator(Model.objects.all()):
    # do stuff
```
This code has three assumptions:
1. Your primary keys are integers (this will not work for UUID primary keys).
2. The primary keys of the queryset are at least somewhat uniformly distributed. If this is not true, the adjusted_page_size can end up too large and you may get one or several massive pages as part of your iteration.
To give a sense of the overhead, I tested this on a Postgres table with 40,000 entries. The queryset_iterator adds about 80% to the iteration time vs raw iteration (2.2 seconds vs 1.2 seconds). That overhead does not vary substantially for page sizes between 200 and 10,000, though it starts going up below 200.
0 讨论(0)
发布评论:

提交评论
- 加载中...
暗喜

2021-01-30 05:46
There is another option available. It wouldn't make the iteration faster, (in fact it would probably slow it down), but it would make it use far less memory. Depending on your needs this may be appropriate.
```
large_qs = MyModel.objects.all().values_list("id", flat=True)
for model_id in large_qs:
    model_object = MyModel.objects.get(id=model_id)
    # do whatever you need to do with the model here
```
Only the ids are loaded into memory, and the objects are retrieved and discarded as needed. Note the increased database load and slower runtime, both tradeoffs for the reduction in memory usage.

I've used this when running async scheduled tasks on worker instances, for which it doesn't really matter if they are slow, but if they try to use way too much memory they may crash the instance and therefore abort the process.
0 讨论(0)
发布评论:

提交评论
- 加载中...

Memory efficient (constant) and speed optimized iteration over a large table in Django

Simple Answer

Complex Answer