Hi all,
I have two models:
class Postcode(models.Model):
code = models.CharField(_('code'), max_length=20, blank=False, null=False, primary_key=True)
lat = models.FloatField(_('Latitude'), null=True, blank=True)
lon = models.FloatField(_('Longitude'), null=True, blank=True)
coordinates = models_gis.PointField(_('Coordinates'), null=True, blank=True, srid=4326, True, spatial_index=True)
def __str__(self):
return self.code
class Meta:
ordering = ['code']
indexes = [
models.Index(
name='country_postcode_lat_idx',
fields=['lat']
),
models.Index(
name='country_postcode_lon_idx',
fields=['lon']
),
]
class Store(models.Model):
# <snip>
postcode = models.ForeignKey(Postcode, null=True, blank=True, default=None, on_delete=models.SET_NULL)
name = models.CharField(_('Name'), max_length=100, blank=False, null=False, default='')
def __str__(self):
return self.code
The postcode table has about 2.3m rows, and the store one about 1000. Nothing too large…
I run a query to find out which postcode are used by stores, how many and sort them in DESC order. Sounds very simple and this how I do it – maybe wrongly because the performance is quite bad, and the generated SQL query in itself I find problematic.
from django.db.models import Count, IntegerField, OuterRef, Subquery
from django.db.models.functions import Coalesce
from country.models import Postcode
from store.models import Store
ps = Postcode.objects.all().annotate(
store_used_count=Coalesce(Subquery(
Store.objects.filter(
postcode__pk=OuterRef('pk'),
).values(
'postcode__pk'
).order_by(
'postcode__pk'
).annotate(
count=Count('pk', distinct=True)
).values(
'count'
),
output_field=IntegerField()
), 0)
).filter(
store_used_count__gt=0
).order_by(
'-store_used_count'
).values(
'pk',
'store_used_count'
)
for p in ps:print(f'{p["pk"]} -> {p["store_used_count"]}')
Now, this runs on laptop in about 12 seconds. Not much quicker a 64 core sql machine… so I am guessing I am indeed doing it wrong and advice/ideas would be very appreciated
Here’s the explain from django:
>>> print(ps.explain())
-> Sort: store_used_count DESC
-> Stream results (cost=238534.87 rows=2282913)
-> Filter: (coalesce((select #3),0) > 0) (cost=238534.87 rows=2282913)
-> Index scan on country_postcode using country_postcode_lat_idx (cost=238534.87 rows=2282913)
-> Select #3 (subquery in condition; dependent)
-> Group aggregate: count(distinct U0.id) (cost=0.58 rows=2)
-> Index lookup on U0 using store_store_postcode_id_955822e2_fk_country_postcode_code (postcode_id=country_postcode.`code`) (cost=0.42 rows=2)
-> Select #2 (subquery in projection; dependent)
-> Group aggregate: count(distinct U0.id) (cost=0.58 rows=2)
-> Index lookup on U0 using store_store_postcode_id_955822e2_fk_country_postcode_code (postcode_id=country_postcode.`code`) (cost=0.42 rows=2)
About the query generated by django, which is:
SELECT
`country_postcode`.`code`,
COALESCE(
(
SELECT
COUNT(DISTINCT U0.`id`) AS `count`
FROM
`store_store` U0
WHERE
U0.`postcode_id` = `country_postcode`.`code`
GROUP BY
U0.`postcode_id`
ORDER BY
NULL
),
0
) AS `store_used_count`
FROM
`country_postcode`
WHERE
COALESCE(
(
SELECT
COUNT(DISTINCT U0.`id`) AS `count`
FROM
`store_store` U0
WHERE
U0.`postcode_id` = `country_postcode`.`code`
GROUP BY
U0.`postcode_id`
ORDER BY
NULL
),
0
) > 0
ORDER BY
`store_used_count` DESC
;
I struggle to understand why the variable store_used_count
is re-declared, instead of making actual use of such as in this manually modified query (which does not run faster, by the way):
SELECT
`country_postcode`.`code`,
COALESCE(
(
SELECT
COUNT(DISTINCT U0.`id`) AS `count`
FROM
`store_store` U0
WHERE
U0.`postcode_id` = `country_postcode`.`code`
GROUP BY
U0.`postcode_id`
ORDER BY
NULL
),
0
) AS `store_used_count`
FROM
`country_postcode`
HAVING
store_used_count > 0
ORDER BY
`store_used_count` DESC
;
Thanks!