Login

Memory efficient Django Queryset Iterator

Author:
WoLpH
Posted:
March 3, 2010
Language:
Python
Version:
1.1
Score:
10 (after 10 ratings)

While checking up on some cronjobs at YouTellMe we had some problems with large cronjobs that took way too much memory. Since Django normally loads all objects into it's memory when iterating over a queryset (even with .iterator, although in that case it's not Django holding it in it's memory, but your database client) I needed a solution that chunks the querysets so they're only keeping a small subset in memory.

Example on how to use it: my_queryset = queryset_iterator(MyItem.objects.all()) for item in my_queryset: item.do_something()

More info on my blog

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import gc

def queryset_iterator(queryset, chunksize=1000):
    '''''
    Iterate over a Django Queryset ordered by the primary key

    This method loads a maximum of chunksize (default: 1000) rows in it's
    memory at the same time while django normally would load all rows in it's
    memory. Using the iterator() method only causes it to not preload all the
    classes.

    Note that the implementation of the iterator does not support ordered query sets.
    '''
    pk = 0
    last_pk = queryset.order_by('-pk')[0].pk
    queryset = queryset.order_by('pk')
    while pk < last_pk:
        for row in queryset.filter(pk__gt=pk)[:chunksize]:
            pk = row.pk
            yield row
        gc.collect()

More like this

  1. Template tag - list punctuation for a list of items by shapiromatron 2 months, 2 weeks ago
  2. JSONRequestMiddleware adds a .json() method to your HttpRequests by cdcarter 2 months, 3 weeks ago
  3. Serializer factory with Django Rest Framework by julio 9 months, 2 weeks ago
  4. Image compression before saving the new model / work with JPG, PNG by Schleidens 10 months, 1 week ago
  5. Help text hyperlinks by sa2812 11 months ago

Comments

guettli (on April 15, 2010):

Django does not load all rows in memory, but it caches the result while iterating over the result. At the end you have everything in memory (if you don't use .iterator()). For most cases this is no problem.

I had memory problems when looping over huge querysets. I solved them with this:

Check connection.queries is empty. settings.DEBUG==True will store all queries there. (Or replace the list with a dummy object, which does not store anything):

from django.db import connection
assert not connection.queries, 'settings.DEBUG=True?'

Use queryset.iterator() to disable the internal cache.

Use values_list() if you know you need only some values.

#

barthed (on September 8, 2010):

Thanks. It saved my day!

The only issue I encountered is that the function throws an exception when the queryset is empty (no result).

#

tomgruner (on November 17, 2010):

Thanks, this worked great for a table I have with 250,000 rows in it!

#

mrcoles (on December 19, 2016):

In terms of handling an empty queryset the following should work:

pk = 0  
last_pk = queryset.order_by('-pk').values_list('pk', flat=True).first()  
if last_pk is not None:  
    queryset = queryset.order_by('pk')  
    while pk < last_pk:  
        for row in queryset.filter(pk__gt=pk)[:chunksize]:  
            pk = row.pk  
            yield row  
        gc.collect()

#

mingdongt (on December 22, 2016):

What would happen if the entry with last-pk gets deleted during this process? Then I guess pk would never reach last-pk, and the loop is infinite?

#

Please login first before commenting.