Annotate Total Created, Modified, and Deleted for each Year

Given a Model Item how would I find the total number of items created, modified, and deleted every year? And can this be done in a single database query?

from django.db import models

class Item(models.Model):
    created_at = models.DateTimeField(null=True, blank=True)
    modified_at = models.DateTimeField(null=True, blank=True)
    deleted_at = models.DateTimeField(null=True, blank=True)

My current query returns the same counts for total_created, total_modified, and total_deleted.

from django.db.models.functions import ExtractYear

Item.objects.annotate(
    created_at_year=ExtractYear("created_at"),
    modified_at_year=ExtractYear("modified_at"),
    deleted_at_year=ExtractYear("deleted_at"),
).values("created_at_year", "modified_at_year", "deleted_at_year").annotate(
    total_created=models.Count("id"),
    total_modified=models.Count("id"),
    total_deleted=models.Count("id"),
)

Example Data

id created_at modified_at deleted_at
1 2020-01-01 01:01:01 2020-01-01 01:01:01 2021-01-01 01:01:01
2 2020-01-01 01:01:01 2021-01-01 01:01:01 2021-01-01 01:01:01
3 2020-01-01 01:01:01 2021-01-01 01:01:01 2022-01-01 01:01:01

Desired Output

{
    2020: {"total_created": 3, "total_modified": 1, "total_deleted": 0},
    2021: {"total_created": 0, "total_modified": 2, "total_deleted": 2},
    2022: {"total_created": 0, "total_modified": 0, "total_deleted": 1},
}

I know it’s possible to add filter parameter to Count but I do not know if I can use that here.

Summary

This text will be hidden

I’m going to make a guess that this is going to take two queries. I’m not thinking of any way of building a query to perform the aggregations by year while also identifying the years to be aggregated.
So I think you can write one query to identify all the years across all the fields, and then a second query to get the counts.
(I’m still thinking on this. I don’t have a full solution.)

Thanks for looking at this Ken.
My current solution which is looping through every Item in my database is a performance bottleneck.

I am not very familiar with Raw SQL or Django Annotations, but if you have any additional input or resources on your two query approach it would be greatly appreciated.

I will (have additional input) - I just need to think on it a bit more.

Couple of background questions that may help create some boundaries around the solution.

  • How many rows are in your Item table?
  • How many years are you expecting to find?
  • How frequently are you expecting this to be run? (Frequently? (multiple times per day) Daily? Weekly? Monthly? )
    • What is your target run time for this calculation?
    • How much time is it taking now?
  • Are the instances of these items updated regularly?
    • Do those updates (potentially) change the created_at and deleted_at fields in addition to the modified_at field?

If this is a table consisting of millions of rows, there may not be a quick solution without doing other work to better structure your data. (e.g. Creating appropriate indexes or pseudo-indexes, or pre-generating and storing intermediate results)

But all these factors may affect the choice of a solution. The larger the table or the faster the results are required, the more work it’s going to be to satisfy those requirements. If the table’s large enough or if the result timings are strict enough, even a 1 or 2 query solution may not satisfy it without other work needing to be done.

This is a simplified abstraction of a real problem but I will try to address your questions as accurately as possible.

  • The Item table has close to a million rows going back to 2012
  • I used Year as an example but it actually will be grouped by Week for a Month at a time, so 4 weeks at a time (Week 1 through 4)
    • There is another variation of this same report that is broken by Day for a Month at a time, so 28 - 31 days at a time (Day 1 through 31)
    • For this timespan of a month there are ~8,000 matching rows
    • In the real problem I am also joining RelatedItem and counting RelatedItem by date as well
  • This report is ran frequently, 50+ times a day by different users
    • Target run time would be anything faster than the current implementation
    • The most frequent variation of this report takes ~30 seconds
  • In the real problem, the dates are changed frequently (including created_at and deleted_at)

Ultimately I am trying to treat this as a learning experience to re-learn SQL and how Django’s ORM works. As I learn more I realize there has to be a better way of using the database rather than looping through all Items one by one.

I rather have a maintainable/understandable solution than the most performant solution.

That all helps a lot - one more question for you.

Is the data being added / updated in the system being added through Django or does it come in and get added outside of Django?

From these descriptions, I don’t think you’re going to find any “pure SQL” or “pure Python” solution satisfactory. Having to extract date-component information from a field (e.g. Year, Month, Week) across a million rows is going to take time. In fact, I’m going to guess (with nothing other than my gut to back it up) that iterating over those rows takes less time than the time it’s taking for the database to send those rows to you - and possibly less time than what it takes for the database to extract the date components.

So I think you’re going to want to facilitate these queries by enhancing your model.

For example, you can create an index for each field for the individual date components. (see Model index reference | Django documentation | Django)

I’d try it, and then check an explain on a query to verify that the query is using the indexes rather than doing a table scan. (If it does use the index for getting the counts, then there’s a lot less need to try to do this in one or two queries. Doing this in 10 queries would still be (relatively) extremely fast.

If the expression indexes don’t help, then your next option would be to add columns to the model for those extracted components, and index and search on them. (The ease in which this could be done depends upon where the data and updates are coming from - that’s why I asked that question.)

If you don’t have sufficient control over that data, then you could look at a related statistics data. This would be a separate table with a one-to-one relationship with the base table, but having that extracted data. Again in this case, the onus is on you to ensure the two stay in sync.

In either of these last two cases, I would even go so far as to suggest that the work be done through triggers in the database layer rather than relying upon Django ensuring it gets done.

Yes, the application is a monolith and everything is done through Django.

I bet your assumption the database is spending most of the time returning those ~8,000 matching rows is correct, which is why I started investigating annotate.

Indexes on the dates are a good idea, I will add those and check the explain/performance on some of my queries to see if any improvements have been made.

Thank you for the tips.

this SQL works, I don’t know if this is optimized or not!

WITH created_year_count AS (
	SELECT 
		EXTRACT('year' FROM "sample_item"."created_at" AT TIME ZONE 'UTC') AS "year",
		COUNT("sample_item"."id") AS "created_count"
	FROM "sample_item" 
	GROUP BY 
		EXTRACT('year' FROM "sample_item"."created_at" AT TIME ZONE 'UTC')
), modified_year_count AS (
	SELECT 
		EXTRACT('year' FROM "sample_item"."modified_at" AT TIME ZONE 'UTC') AS "year",
		COUNT("sample_item"."id") AS "modified_count"
	FROM "sample_item" 
	GROUP BY 
		EXTRACT('year' FROM "sample_item"."modified_at" AT TIME ZONE 'UTC')
), deleted_year_count AS (
	SELECT 
		EXTRACT('year' FROM "sample_item"."deleted_at" AT TIME ZONE 'UTC') AS "year",
		COUNT("sample_item"."id") AS "deleted_count"
	FROM "sample_item" 
	GROUP BY 
		EXTRACT('year' FROM "sample_item"."deleted_at" AT TIME ZONE 'UTC')
)
SELECT cyc.year
,	cyc.created_count
,	myc.modified_count
,	dyc.deleted_count
FROM created_year_count as cyc
	LEFT OUTER JOIN modified_year_count as myc
	ON cyc.year = myc.year
	LEFT OUTER JOIN deleted_year_count as dyc
	ON cyc.year = dyc.year