Memory efficient (constant) and speed optimized iteration over a large table in Django

前端 未结 3 1029
感动是毒
感动是毒 2021-01-30 05:20

I have a very large table. It\'s currently in a MySQL database. I use django.

I need to iterate over each element of the table to pre-compute some parti

相关标签:
3条回答
  • 2021-01-30 05:43

    The essential answer: use raw SQL with server-side cursors.

    Sadly, until Django 1.5.2 there is no formal way to create a server-side MySQL cursor (not sure about other database engines). So I wrote some magic code to solve this problem.

    For Django 1.5.2 and MySQLdb 1.2.4, the following code will work. Also, it's well commented.

    Caution: This is not based on public APIs, so it will probably break in future Django versions.

    # This script should be tested under a Django shell, e.g., ./manage.py shell
    
    from types import MethodType
    
    import MySQLdb.cursors
    import MySQLdb.connections
    from django.db import connection
    from django.db.backends.util import CursorDebugWrapper
    
    
    def close_sscursor(self):
        """An instance method which replace close() method of the old cursor.
    
        Closing the server-side cursor with the original close() method will be
        quite slow and memory-intensive if the large result set was not exhausted,
        because fetchall() will be called internally to get the remaining records.
        Notice that the close() method is also called when the cursor is garbage 
        collected.
    
        This method is more efficient on closing the cursor, but if the result set
        is not fully iterated, the next cursor created from the same connection
        won't work properly. You can avoid this by either (1) close the connection 
        before creating a new cursor, (2) iterate the result set before closing 
        the server-side cursor.
        """
        if isinstance(self, CursorDebugWrapper):
            self.cursor.cursor.connection = None
        else:
            # This is for CursorWrapper object
            self.cursor.connection = None
    
    
    def get_sscursor(connection, cursorclass=MySQLdb.cursors.SSCursor):
        """Get a server-side MySQL cursor."""
        if connection.settings_dict['ENGINE'] != 'django.db.backends.mysql':
            raise NotImplementedError('Only MySQL engine is supported')
        cursor = connection.cursor()
        if isinstance(cursor, CursorDebugWrapper):
            # Get the real MySQLdb.connections.Connection object
            conn = cursor.cursor.cursor.connection
            # Replace the internal client-side cursor with a sever-side cursor
            cursor.cursor.cursor = conn.cursor(cursorclass=cursorclass)
        else:
            # This is for CursorWrapper object
            conn = cursor.cursor.connection
            cursor.cursor = conn.cursor(cursorclass=cursorclass)
        # Replace the old close() method
        cursor.close = MethodType(close_sscursor, cursor)
        return cursor
    
    
    # Get the server-side cursor
    cursor = get_sscursor(connection)
    
    # Run a query with a large result set. Notice that the memory consumption is low.
    cursor.execute('SELECT * FROM million_record_table')
    
    # Fetch a single row, fetchmany() rows or iterate it via "for row in cursor:"
    cursor.fetchone()
    
    # You can interrupt the iteration at any time. This calls the new close() method,
    # so no warning is shown.
    cursor.close()
    
    # Connection must be close to let new cursors work properly. see comments of
    # close_sscursor().
    connection.close()
    
    0 讨论(0)
  • 2021-01-30 05:45

    Simple Answer

    If you just need to iterate over the table itself without doing anything fancy, Django comes with a builtin iterator:

    queryset.iterator()
    

    This causes Django to clean up it's own cache to reduce memory use. Note that for truly large tables, this may not be enough.


    Complex Answer

    If you are doing something more complex with each object or have a lot of data, you have to write your own. The following is a queryset iterator that splits the queryset into chunks and is not much slower than the basic iterator (it will be a linear number of database queries, as opposed to 1, but it will only one query per 1,000 rows). This function pages by primary key, which is necessary for efficient implementation since offset is a linear time operation in most SQL databases.

    def queryset_iterator(queryset, page_size=1000):
        if not queryset:
            return
        max_pk = queryset.order_by("-pk")[0].pk
        # Scale the page size up by the average density of primary keys in the queryset
        adjusted_page_size = int(page_size * max_pk / queryset.count())
        
        pages = int(max_pk / adjusted_page_size) + 1
        for page_num in range(pages):
            lower = page_num * adjusted_page_size
            page = queryset.filter(pk__gte=lower, pk__lt=lower+page_size)
            for obj in page:
                yield obj
    

    Use looks like:

    for obj in queryset_iterator(Model.objects.all()):
        # do stuff
    

    This code has three assumptions:

    1. Your primary keys are integers (this will not work for UUID primary keys).
    2. The primary keys of the queryset are at least somewhat uniformly distributed. If this is not true, the adjusted_page_size can end up too large and you may get one or several massive pages as part of your iteration.

    To give a sense of the overhead, I tested this on a Postgres table with 40,000 entries. The queryset_iterator adds about 80% to the iteration time vs raw iteration (2.2 seconds vs 1.2 seconds). That overhead does not vary substantially for page sizes between 200 and 10,000, though it starts going up below 200.

    0 讨论(0)
  • 2021-01-30 05:46

    There is another option available. It wouldn't make the iteration faster, (in fact it would probably slow it down), but it would make it use far less memory. Depending on your needs this may be appropriate.

    large_qs = MyModel.objects.all().values_list("id", flat=True)
    for model_id in large_qs:
        model_object = MyModel.objects.get(id=model_id)
        # do whatever you need to do with the model here
    

    Only the ids are loaded into memory, and the objects are retrieved and discarded as needed. Note the increased database load and slower runtime, both tradeoffs for the reduction in memory usage.

    I've used this when running async scheduled tasks on worker instances, for which it doesn't really matter if they are slow, but if they try to use way too much memory they may crash the instance and therefore abort the process.

    0 讨论(0)
提交回复
热议问题