Identifying candidates for 'only' and 'defer'

Hi All,

We have a Django application that almost everywhere pulls every single attribute of every single table joined in queries issued from the ORM. I’m looking at a query today that retrieves 225 attributes from 12 tables. It runs 5 times longer than when I return a reasonable subset of the attributes, say, half a dozen. The vast, vast majority of these attributes are simply thrown away, and that represents wasted IO, memory and compute. We need to start cutting these result sets down using ‘only’ and ‘defer’, but this is an application with over 450 tables and a great deal of code, so identifying where to do this with greatest effect is not going to be easy.

My question: does anyone have any suggestions for instrumenting the code so that we could easily see when attributes are pulled from the database and not used? I’m imagining something like a patch on the QuerySet class that monitors attributes loaded, and attributes read by the calling code. I haven’t gotten farther down the road than wishful thinking. Thought I’d ask for some advice before throwing a bunch of time at it.

Thanks!
David

<conjecture>
Hmmm … you might be able to add something to the Model metaclass to track / log references to individual fields. It’s not perfect and will have a number of significant “holes”* - but it may be a place to start.

  • Right off-hand I can see where using the values clause is going to negate this. What’s returned from it is effectively a list of dicts, and so the model class has no part of the elements being retrieved.

Depending upon the complexity and dynamics of your code, you may get some more-or-less reasonable numbers. I guess it may come down to just how accurate you wish to be.
</conjecture>

Hello @djantzen shameless plug for a package I maintain that I suggest you have a look at.

django-seal allows you to opt-in some of your models into emitting warnings when fields are unexpectedly lazily fetched instead of silently fetching them.

A strategy to tame your queries here could be to seal one of your model and make use of a very conservative .only call on a queryset passed down your application. I’d suggest starting with a set of fields you know for sure will be used one way or the other and elevating warnings to errors when running your test suite and / or piping warnings you encounter to your logging system in production to give you a sense of our you need to augment your only predicate.

From there you can keep using seal with elevated warnings to errors when your running your test suite and you should be able to catch any regression and fetching just another field lazily.

1 Like