Annotations outcome in generated SQL - duplicates/repetition

plopidou · March 30, 2022, 1:22pm

Hi,

I want to try and understand why the generated SQL for annotations always (?) results in the subqueries being duplicated when re-used in other annotations, instead of using the relevant AS variables. For example:

class Person(models.Model):
    created_datetime = models.DateTimeField('Created at', null=False, blank=False, default=timezone.now)
    name = models.CharField('Name', max_length=128, null=True, blank=True)
    parent = models.ForeignKey('self', null=True, blank=True, on_delete=models.CASCADE)
    
    def __str__(self):
        return self.name

class PersonLog(models.Model):
    person = models.ForeignKey('person.Person', null=True, blank=True, on_delete=models.CASCADE)
    created_datetime = models.DateTimeField('Created at', null=False, blank=False, default=timezone.now)
    login = models.BooleanField('Login', null=True, blank=True)
    
    def __str__(self):
        return self.person.name

In an example used elsewhere, I build a queryset like so:

persons = Person.objects.all().annotate(
    log_login_count=Count(
        'personlog',
        filter=Q(personlog__login=True)
    ),
    log_login_duration_over=Now() - F('created_datetime'),
    log_login_frequency=ExpressionWrapper(
        F('log_login_duration_over') / F('log_login_count'),
        output_field=models.DurationField(),
    ),
).values()

Testing this on sqlite (but the same happens when running the same query on MySQL – no idea about PG), the generated SQL is:

SELECT
  "person_person"."id",
  "person_person"."created_datetime",
  "person_person"."name",
  "person_person"."parent_id",
  COUNT("person_personlog"."id") FILTER (
    WHERE
      "person_personlog"."login"
  ) AS "log_login_count",
  django_timestamp_diff(
    CURRENT_TIMESTAMP,
    "person_person"."created_datetime"
  ) AS "log_login_duration_over",
  (
    django_format_dtdelta(
      '/',
      django_timestamp_diff(
        CURRENT_TIMESTAMP,
        "person_person"."created_datetime"
      ),
      COUNT("person_personlog"."id") FILTER (
        WHERE
          "person_personlog"."login"
      )
    )
  ) AS "log_login_frequency"
FROM
  "person_person"
  LEFT OUTER JOIN "person_personlog" ON (
    "person_person"."id" = "person_personlog"."person_id"
  )
GROUP BY
  "person_person"."id",
  "person_person"."created_datetime",
  "person_person"."name",
  "person_person"."parent_id",
  django_timestamp_diff(
    CURRENT_TIMESTAMP,
    "person_person"."created_datetime"
  )

Notice this:

  (
    django_format_dtdelta(
      '/',
      django_timestamp_diff(
        CURRENT_TIMESTAMP,
        "person_person"."created_datetime"
      ),
      COUNT("person_personlog"."id") FILTER (
        WHERE
          "person_personlog"."login"
      )
    )
  ) AS "log_login_frequency"

Given the two operands of the division are computed/declared just above, why is the complete query not generated as:

SELECT
  "person_person"."id",
  "person_person"."created_datetime",
  "person_person"."name",
  "person_person"."parent_id",
  COUNT("person_personlog"."id") FILTER (
    WHERE
      "person_personlog"."login"
  ) AS "log_login_count",
  django_timestamp_diff(
    CURRENT_TIMESTAMP,
    "person_person"."created_datetime"
  ) AS "log_login_duration_over",
  (
    django_format_dtdelta(
      '/',
      log_login_duration_over,
      log_login_count
    )
  ) AS "log_login_frequency"
FROM
  "person_person"
  LEFT OUTER JOIN "person_personlog" ON (
    "person_person"."id" = "person_personlog"."person_id"
  )
GROUP BY
  "person_person"."id",
  "person_person"."created_datetime",
  "person_person"."name",
  "person_person"."parent_id",
  django_timestamp_diff(
    CURRENT_TIMESTAMP,
    "person_person"."created_datetime"
  )

where both log_login_duration_over and log_login_count are actually used directly after being “declared”.

The advantages seem obvious:

less bytes sent over the wire
more performance? Would the redeclaration/recomputation be heavily cached at database level anyway?

I am sure there is a good reason for this, but would like to understand it!

Many thanks!

felixxm · March 31, 2022, 4:12am

Using AS in SQL queries doesn’t create virtual columns or sth like variables. Declared aliases are not accessible in SELECT or WHERE clauses and cannot be reused. The proposed query will crash with:

Error: no such column: log_login_count
Error: no such column: log_login_duration_over

plopidou · March 31, 2022, 8:30am

Yes that is a good reason, thanks

Just tried this on MySQL 8.xx though, and it works (it’s a quick test mind you, so probably a can of worms with baddies I am not aware of…)

SELECT
  DISTINCT 
  `person`.`id`,
  `person`.`name`,
  @login_count := COUNT(
    CASE WHEN `person_personlog`.`login` THEN `person_personlog`.`id` ELSE NULL END
  ) AS `login_count`,
  @login_duration_over := TIMESTAMPDIFF(
    MICROSECOND,
    `person_person`.`created_datetime`,
    CURRENT_TIMESTAMP
  ) AS `login_duration_over`,
  @login_frequency := (@login_duration_over / @login_count) AS `login_frequency`,

<snip>

Where login_duration_over , login_count are declared once and reused to compute login_frequency.

Thoughts?

felixxm · March 31, 2022, 9:08am

This won’t work on all database and as far as I’m aware a similar syntax doesn’t exist on Oracle or SQLite. Moreover database optimizer should reuse it anyway.

plopidou · March 31, 2022, 10:04am

yep ok.
I suspected this would be optimised/cached at database level anyway, but glad to get a confirmation.

thanks

Topic		Replies	Views
Annotate duration/integer -> MySql error Using the ORM	16	1484	March 28, 2022
Annotate from related QuerySet Using the ORM	5	2396	November 30, 2023
annotation of another annotation duplicate query rule Using the ORM	4	828	December 9, 2022
How Query class work? Django Internals	3	581	May 10, 2022
annotate with subquery in Django Using Django	13	6269	June 26, 2020

Annotations outcome in generated SQL - duplicates/repetition

Related topics