I’ve been focussing on performance improvements of our ListView
pages in the past month or so and I’ve made great strides while doing some rapid prototyping.
I wasn’t the original developer for these pages and they did not implement server-side pagination. Our samples page could take nearly 3 minutes to load. Now a page load with a reasonable number of results seems nearly instantaneous. 1000 rows takes 15s, which I’m satisfied with, though I might try to tweak it.
I’m of course using prefetch_related
for all of the related tables and am thinking about adding usage of only
and defer
.
However, there’s one page that I have tried to focus on that I can’t seem to get that last bit of performance out of, and that’s our ArchiveFileListView
page. It can take 3-6 seconds per 10-row page and 15-20s for a 1000 row page. So the 1000 row page is on par with the samples page, but the 10-row page irks me that it takes on average, 4s to load. The reason for its slowness is the study column. If I eliminate that column, it’s nearly instantaneous. The ArchiveFile
model is linked to from a few places, and those places are far from the Study
model. The field paths are:
peak_groups__msrun_sample__sample__animal__studies
mz_to_msrunsamples__sample__animal__studies
raw_to_msrunsamples__sample__animal__studies
There are 2 many-related steps in each of these paths (only 1 is ever populated, BTW, so I use a Case
/When
strategy based on file type [after playing around with Coalesce
]):
ArchiveFile.peak_groups
(ArchiveFile
:PeakGroup
),ArchiveFile.raw_to_msrunsamples
(ArchiveFile
:MSRunSample
), andArchiveFile.mz_to_msrunsamples
(ArchiveFile
:MSRunSample
) are all reverse relations, essentially one to many from the perspective of ArchiveFile.Animal.studies
is a many to many relationship.
I had initially explored using Prefetch
’s to_attr
argument to hold the unique Study
objects, but since there’s another many-related model on the path before it, I couldn’t get that to work, so I overrode paginate_queryset
to set the attribute on the ArchiveFile
object by iterating through the page’s worth of results. That trick got me from 1m down to the 15-20s
timing for 1000 rows.
There are lots of PeakGroup
and MSRunSample
records that all link to the same ArchiveFile
records, but they all link to the same Animal and there’s usually 1 or 2 studies that an animal belongs to.
Each row on the ArchiveFile
page should be a unique ArchiveFile
record and the Study column should be a delimited list of unique (linked) study records.
Should I just maintain a “studies” many-to-many link from ArchiveFile
to Study
. I try to avoid the redundant links, but I’m getting the feeling that it’s either that or caching that would be necessary, just due to the data relationships.
Extra info:
I have not yet investigated how many queries are being executed. I plan to install the django debug toolbar for the first time this week and see if I can discern where the bottlenecks are.
I also have a tsv iterator for exporting the table and it runs satisfyingly fast on the entire model. In fact, it was that strategy I used to refactor the view/template. It wasn’t storing objects. It was just getting fields (no keys).