Annotations outcome in generated SQL - duplicates/repetition

Hi,

I want to try and understand why the generated SQL for annotations always (?) results in the subqueries being duplicated when re-used in other annotations, instead of using the relevant AS variables. For example:

class Person(models.Model):
    created_datetime = models.DateTimeField('Created at', null=False, blank=False, default=timezone.now)
    name = models.CharField('Name', max_length=128, null=True, blank=True)
    parent = models.ForeignKey('self', null=True, blank=True, on_delete=models.CASCADE)
    
    def __str__(self):
        return self.name

class PersonLog(models.Model):
    person = models.ForeignKey('person.Person', null=True, blank=True, on_delete=models.CASCADE)
    created_datetime = models.DateTimeField('Created at', null=False, blank=False, default=timezone.now)
    login = models.BooleanField('Login', null=True, blank=True)
    
    def __str__(self):
        return self.person.name

In an example used elsewhere, I build a queryset like so:

persons = Person.objects.all().annotate(
    log_login_count=Count(
        'personlog',
        filter=Q(personlog__login=True)
    ),
    log_login_duration_over=Now() - F('created_datetime'),
    log_login_frequency=ExpressionWrapper(
        F('log_login_duration_over') / F('log_login_count'),
        output_field=models.DurationField(),
    ),
).values()

Testing this on sqlite (but the same happens when running the same query on MySQL – no idea about PG), the generated SQL is:

SELECT
  "person_person"."id",
  "person_person"."created_datetime",
  "person_person"."name",
  "person_person"."parent_id",
  COUNT("person_personlog"."id") FILTER (
    WHERE
      "person_personlog"."login"
  ) AS "log_login_count",
  django_timestamp_diff(
    CURRENT_TIMESTAMP,
    "person_person"."created_datetime"
  ) AS "log_login_duration_over",
  (
    django_format_dtdelta(
      '/',
      django_timestamp_diff(
        CURRENT_TIMESTAMP,
        "person_person"."created_datetime"
      ),
      COUNT("person_personlog"."id") FILTER (
        WHERE
          "person_personlog"."login"
      )
    )
  ) AS "log_login_frequency"
FROM
  "person_person"
  LEFT OUTER JOIN "person_personlog" ON (
    "person_person"."id" = "person_personlog"."person_id"
  )
GROUP BY
  "person_person"."id",
  "person_person"."created_datetime",
  "person_person"."name",
  "person_person"."parent_id",
  django_timestamp_diff(
    CURRENT_TIMESTAMP,
    "person_person"."created_datetime"
  )

Notice this:

  (
    django_format_dtdelta(
      '/',
      django_timestamp_diff(
        CURRENT_TIMESTAMP,
        "person_person"."created_datetime"
      ),
      COUNT("person_personlog"."id") FILTER (
        WHERE
          "person_personlog"."login"
      )
    )
  ) AS "log_login_frequency"

Given the two operands of the division are computed/declared just above, why is the complete query not generated as:

SELECT
  "person_person"."id",
  "person_person"."created_datetime",
  "person_person"."name",
  "person_person"."parent_id",
  COUNT("person_personlog"."id") FILTER (
    WHERE
      "person_personlog"."login"
  ) AS "log_login_count",
  django_timestamp_diff(
    CURRENT_TIMESTAMP,
    "person_person"."created_datetime"
  ) AS "log_login_duration_over",
  (
    django_format_dtdelta(
      '/',
      log_login_duration_over,
      log_login_count
    )
  ) AS "log_login_frequency"
FROM
  "person_person"
  LEFT OUTER JOIN "person_personlog" ON (
    "person_person"."id" = "person_personlog"."person_id"
  )
GROUP BY
  "person_person"."id",
  "person_person"."created_datetime",
  "person_person"."name",
  "person_person"."parent_id",
  django_timestamp_diff(
    CURRENT_TIMESTAMP,
    "person_person"."created_datetime"
  )

where both log_login_duration_over and log_login_count are actually used directly after being “declared”.

The advantages seem obvious:

  • less bytes sent over the wire
  • more performance? Would the redeclaration/recomputation be heavily cached at database level anyway?

I am sure there is a good reason for this, but would like to understand it!

Many thanks!

Using AS in SQL queries doesn’t create virtual columns or sth like variables. Declared aliases are not accessible in SELECT or WHERE clauses and cannot be reused. The proposed query will crash with:

Error: no such column: log_login_count
Error: no such column: log_login_duration_over

Yes that is a good reason, thanks :slight_smile:

Just tried this on MySQL 8.xx though, and it works (it’s a quick test mind you, so probably a can of worms with baddies I am not aware of…)

SELECT
  DISTINCT 
  `person`.`id`,
  `person`.`name`,
  @login_count := COUNT(
    CASE WHEN `person_personlog`.`login` THEN `person_personlog`.`id` ELSE NULL END
  ) AS `login_count`,
  @login_duration_over := TIMESTAMPDIFF(
    MICROSECOND,
    `person_person`.`created_datetime`,
    CURRENT_TIMESTAMP
  ) AS `login_duration_over`,
  @login_frequency := (@login_duration_over / @login_count) AS `login_frequency`,

<snip>

Where login_duration_over , login_count are declared once and reused to compute login_frequency.

Thoughts?

This won’t work on all database and as far as I’m aware a similar syntax doesn’t exist on Oracle or SQLite. Moreover database optimizer should reuse it anyway.

yep ok.
I suspected this would be optimised/cached at database level anyway, but glad to get a confirmation.

thanks :slight_smile: