Query performance question

Hi all,

I have two models:

class Postcode(models.Model):
    
    code = models.CharField(_('code'), max_length=20, blank=False, null=False, primary_key=True)
    lat = models.FloatField(_('Latitude'), null=True, blank=True)
    lon = models.FloatField(_('Longitude'), null=True, blank=True)
    coordinates = models_gis.PointField(_('Coordinates'), null=True, blank=True, srid=4326, True, spatial_index=True)
    
    def __str__(self):
        return self.code

    class Meta:
        ordering = ['code']
        indexes = [
            models.Index(
                name='country_postcode_lat_idx',
                fields=['lat']
            ),
            models.Index(
                name='country_postcode_lon_idx',
                fields=['lon']
            ),
        ]
        
        
class Store(models.Model):
    # <snip>
    postcode = models.ForeignKey(Postcode, null=True, blank=True, default=None, on_delete=models.SET_NULL)
    name = models.CharField(_('Name'), max_length=100, blank=False, null=False, default='')
    
    def __str__(self):
        return self.code

The postcode table has about 2.3m rows, and the store one about 1000. Nothing too large…

I run a query to find out which postcode are used by stores, how many and sort them in DESC order. Sounds very simple and this how I do it – maybe wrongly because the performance is quite bad, and the generated SQL query in itself I find problematic.

from django.db.models import Count, IntegerField, OuterRef, Subquery
from django.db.models.functions import Coalesce
from country.models import Postcode
from store.models import Store
ps = Postcode.objects.all().annotate(
    store_used_count=Coalesce(Subquery(
        Store.objects.filter(
                postcode__pk=OuterRef('pk'),
        ).values(
                'postcode__pk'
        ).order_by(
                'postcode__pk'
        ).annotate(
                count=Count('pk', distinct=True)
        ).values(
                'count'
        ),
        output_field=IntegerField()
    ), 0)
).filter(
    store_used_count__gt=0
).order_by(
    '-store_used_count'
).values(
    'pk',
    'store_used_count'
)
for p in ps:print(f'{p["pk"]} -> {p["store_used_count"]}')

Now, this runs on laptop in about 12 seconds. Not much quicker a 64 core sql machine… so I am guessing I am indeed doing it wrong and advice/ideas would be very appreciated :slight_smile:

Here’s the explain from django:

>>> print(ps.explain())
-> Sort: store_used_count DESC
    -> Stream results  (cost=238534.87 rows=2282913)
        -> Filter: (coalesce((select #3),0) > 0)  (cost=238534.87 rows=2282913)
            -> Index scan on country_postcode using country_postcode_lat_idx  (cost=238534.87 rows=2282913)
            -> Select #3 (subquery in condition; dependent)
                -> Group aggregate: count(distinct U0.id)  (cost=0.58 rows=2)
                    -> Index lookup on U0 using store_store_postcode_id_955822e2_fk_country_postcode_code (postcode_id=country_postcode.`code`)  (cost=0.42 rows=2)
-> Select #2 (subquery in projection; dependent)
    -> Group aggregate: count(distinct U0.id)  (cost=0.58 rows=2)
        -> Index lookup on U0 using store_store_postcode_id_955822e2_fk_country_postcode_code (postcode_id=country_postcode.`code`)  (cost=0.42 rows=2)

About the query generated by django, which is:

SELECT
  `country_postcode`.`code`,
  COALESCE(
    (
      SELECT
        COUNT(DISTINCT U0.`id`) AS `count`
      FROM
        `store_store` U0
      WHERE
        U0.`postcode_id` = `country_postcode`.`code`
      GROUP BY
        U0.`postcode_id`
      ORDER BY
        NULL
    ),
    0
  ) AS `store_used_count`
FROM
  `country_postcode`
WHERE
  COALESCE(
    (
      SELECT
        COUNT(DISTINCT U0.`id`) AS `count`
      FROM
        `store_store` U0
      WHERE
        U0.`postcode_id` = `country_postcode`.`code`
      GROUP BY
        U0.`postcode_id`
      ORDER BY
        NULL
    ),
    0
  ) > 0
ORDER BY
  `store_used_count` DESC
;

I struggle to understand why the variable store_used_count is re-declared, instead of making actual use of such as in this manually modified query (which does not run faster, by the way):

SELECT
  `country_postcode`.`code`,
  COALESCE(
    (
      SELECT
        COUNT(DISTINCT U0.`id`) AS `count`
      FROM
        `store_store` U0
      WHERE
        U0.`postcode_id` = `country_postcode`.`code`
      GROUP BY
        U0.`postcode_id`
      ORDER BY
        NULL
    ),
    0
  ) AS `store_used_count`
FROM
  `country_postcode`
HAVING
  store_used_count > 0
ORDER BY
  `store_used_count` DESC
;

Thanks!

What’s wrong with:
Postcode.objects.annotate(count=Count('store')).order_by('-count')?

Nothing is wrong with that query … which yields the same queryset. A tiny bit faster. Thanks!

Which has me suspect it might be an indexing thing (somewhat beyond django itself, but if anyone had an insight, I would appreciate it :slight_smile: )

It’s not an indexing issue. Any query returning millions of rows is going to take a long time to execute.

The actual query itself is probably a relatively small portion of the overall time. Copying a million rows from the database to Python is going to take a bit of time. Python, in creating a million instances of the Postcode object is also going to take a chunk of time.
You might be able to factor out that last step by adding the values clause to the query - perhaps values('code', 'count').
You can also get an idea of how much time is spent returning values by adding a filter to the expression to limit the number of rows returned - perhaps filter(count__gt=0) to just get the postcodes serviced by stores, returning the number of values to no more than the 1000 stores you have.

So I’d suggest trying:
Postcode.objects.annotate(count=Count('store')).order_by('-count').filter(count__gt=0).values('code', 'count')
and see how quickly that runs.

Hi and thanks again,

I you closely look at the original queryset expression at the very top:

ps = Postcode.objects.all().annotate(
    store_used_count=Coalesce(Subquery(
        Store.objects.filter(
                postcode__pk=OuterRef('pk'),
        ).values(
                'postcode__pk'
        ).order_by(
                'postcode__pk'
        ).annotate(
                count=Count('pk', distinct=True)
        ).values(
                'count'
        ),
        output_field=IntegerField()
    ), 0)
).filter(
    store_used_count__gt=0
).order_by(
    '-store_used_count'
).values(
    'pk',
    'store_used_count'
)

You will notice that:

  • the filter is there
  • full objects are not instantiated but only values are requested

the 12 seconds is against this queryset. About 600 rows are returned.

Cheers, :slight_smile:

Ok, but the filter wasn’t there in my suggestion - that’s what I’m recommending you to add to see how that affects the performance.

After loading the table(s), did you do a full analyze? I find it interesting that your explain showed that you were using the country_postcode_lat_idx index instead of the primary key index.

(I’d also be interested in seeing the explain for the completely revised query.)

ah right. I did it anyway when testing (didn’t want to wait 10 minutes for the console to finish outputting the resultset :p)

it was:

ps = Postcode.objects.annotate(count=Count('store')).filter(count__gt=0).values('pk', 'count').order_by('-count')

resulting in:

SELECT
  `postcode`.`pk`,
  COUNT(`store`.`id`) AS `count`
FROM
  `postcode`
  LEFT OUTER JOIN `store` ON (
    `postcode`.`pk` = `store`.`postcode_id`
  )
GROUP BY
  `postcode`.`code`
HAVING
  COUNT(`store`.`id`) > 0
ORDER BY
  `count` DESC

and executing in the same enviroment in ~11.9 secs.

The explain seems ok, with only index and ref types so … could it be simply linked to the size of the postcode table?

Yes, that’s my initial reaction.

It might be interesting to see the explain on this query. Also, depending upon how you built / loaded the postcode table, it might be worth running a reindex / analyze on that table.

sure thing, here’s the explain:

-> Sort: count DESC
    -> Filter: (count(store_store.id) > 0)
        -> Stream results  (cost=1561023.56 rows=3749634)
            -> Group aggregate: count(store_store.id), count(store_store.id)  (cost=1561023.56 rows=3749634)
                -> Nested loop left join  (cost=1186060.15 rows=3749634)
                    -> Index scan on country_postcode using PRIMARY  (cost=236508.70 rows=2282913)
                    -> Index lookup on store_store using store_store_postcode_id_955822e2_fk_country_postcode_code (postcode_id=country_postcode.`code`)  (cost=0.25 rows=2)

Yep, it’s due to the size of that table.

The query might be able to be optimized more in a couple different ways. One way would be to use a technique I think of as “turning the query inside-out” - in this case this would mean to filter the Postcode table to only those rows being related to by the Store table before doing the counts. (I’d have to think a little bit about exactly how to do this in the ORM - I don’t have an immediate off-the-top-of-my-head type answer.)

The other way would be to look to optimize the database server a little more regarding things like amount of memory allocated to the different caches. In the ideal situation, you’d want enough memory allocated for close to 2x the size of the tables + indexes. (Also, there is some tuning that can be done depending upon whether your storage is physical disk or solid-state.)

ok so two things happened here.

on the database itself:

ALTER TABLE `db_name`.country_postcode ENGINE = InnoDB;

as per MySQL :: MySQL 8.0 Reference Manual :: 2.11.13 Rebuilding or Repairing Tables or Indexes

then … the “turning inside-out”, gave it a try using:

from django.db.models import Count, IntegerField, OuterRef, Subquery, Exists
from django.db.models.functions import Coalesce
from country.models import Postcode
from store.models import Store
ps = Postcode.objects.annotate(
    is_used=Exists(
        Store.objects.filter(
            postcode__pk=OuterRef('pk'),
        )
    ),
    store_used_count=Coalesce(Subquery(
        Store.objects.filter(
            postcode__pk=OuterRef('pk'),
        ).values(
            'postcode__pk'
        ).order_by(
            'postcode__pk'
        ).annotate(
            count=Count('pk', distinct=True)
        ).values(
            'count'
        ),
        output_field=IntegerField()
    ), 0)
).filter(
    is_used=True,
    store_used_count__gt=0
).order_by(
    '-store_used_count'
).values(
    'pk',
    'store_used_count'
)

which in raw SQL was translated into:

SELECT
  `country_postcode`.`code`,
  COALESCE(
    (
      SELECT
        COUNT(DISTINCT U0.`id`) AS `count`
      FROM
        `store_store` U0
      WHERE
        U0.`postcode_id` = `country_postcode`.`code`
      GROUP BY
        U0.`postcode_id`
      ORDER BY
        NULL
    ),
    0
  ) AS `store_used_count`
FROM
  `country_postcode`
WHERE
  (
    EXISTS(
      SELECT
        (1) AS `a`
      FROM
        `store_store` U0
      WHERE
        U0.`postcode_id` = `country_postcode`.`code`
      LIMIT
        1
    )
    AND COALESCE(
      (
        SELECT
          COUNT(DISTINCT U0.`id`) AS `count`
        FROM
          `store_store` U0
        WHERE
          U0.`postcode_id` = `country_postcode`.`code`
        GROUP BY
          U0.`postcode_id`
        ORDER BY
          NULL
      ),
      0
    ) > 0
  )
ORDER BY
  `store_used_count` DESC
;

execution time …

0.015 seconds :smiley:

I think we’ve got it :smiley:


my only gripe would be … I should have run the new query against a non-rebuilt table index first, to be sure… :frowning:

ok I run the non optimised query on the “rebuilt” table. Still 12 seconds. so the Exists() filter is obviously the winner here.

based on the following explain:

-> Sort: store_used_count DESC
    -> Stream results  (cost=1079.99 rows=928)
        -> Nested loop inner join  (cost=1079.99 rows=928)
            -> Remove duplicates from input sorted on store_store_postcode_id_955822e2_fk_country_postcode_code  (cost=59.19 rows=928)
                -> Index scan on U0 using store_store_postcode_id_955822e2_fk_country_postcode_code  (cost=59.19 rows=928)
            -> Filter: (coalesce((select #4),0) > 0)  (cost=928.10 rows=1)
                -> Single-row index lookup on country_postcode using PRIMARY (code=U0.postcode_id)  (cost=928.10 rows=1)
                -> Select #4 (subquery in condition; dependent)
                    -> Group aggregate: count(distinct U0.id)  (cost=0.58 rows=2)
                        -> Index lookup on U0 using store_store_postcode_id_955822e2_fk_country_postcode_code (postcode_id=country_postcode.`code`)  (cost=0.42 rows=2)
-> Select #2 (subquery in projection; dependent)
    -> Group aggregate: count(distinct U0.id)  (cost=0.58 rows=2)
        -> Index lookup on U0 using store_store_postcode_id_955822e2_fk_country_postcode_code (postcode_id=country_postcode.`code`)  (cost=0.42 rows=2)

It seems obvious that the exists() does considerably limit the scope of the (upcoming) scan, by limiting it to only 900-ish rows, and using an eq_ref.

Could it be made better?

@KenWhitesell : Subquery is used because other filters have to be potentially used on the Store objects.

and no, the database is not optimised at all in terms of buffer pool size, threads, etc. It just runs on a laptop->virtualbox->centos 7 guest, using an almost off-the-shelf my.cnf.

As a heads up, that is_used annotation should be replaceable with a check of __isnull=False :

ps = Postcode.objects.filter(
    store_set__isnull=False
).annotate(
    # I dropped the coalesce since it should be returning the count.
    store_used_count=Subquery(
        Store.objects.filter(
            postcode__pk=OuterRef('pk'),
        ).values(
            'postcode__pk'
        ).order_by(
            'postcode__pk'
        ).annotate(
            count=Count('pk', distinct=True)
        ).values(
            'count'
        ),
        output_field=IntegerField()
    ),
).filter(
    store_used_count__gt=0
).order_by(
    '-store_used_count'
).values(
    'pk',
    'store_used_count'
)

I would also suggest using @KenWhitesell’s proposal of using the Count without needing the Subquery at this point as well to reduce the amount of code.

ps = Postcode.objects.filter(
    store__isnull=False
).annotate(
    store_used_count=Count('store')
).filter(
    store_used_count__gt=0
).order_by(
    '-store_used_count'
).values(
    'pk',
    'store_used_count'
)

thank you guys, this runs now much better. Have to keep the Subqueries though, for the reason mentioned above :slight_smile:

If you’re more specific about those additional filters, we may be able to suggest alternatives. There’s a lot of flexibility there you may not be familiar with.

simple things coming in from a form’s cleaned_data. Have tested and does not impact performance in a noticeable way (for now :slight_smile: )