Hi folks,
This one’s a long shot but thought I might as well throw the idea out there:
I often make use of Postgres’ generate_series() to generate reports aggregating data against a date range in buckets of intervals like hourly, weekly, monthly etc.
You can normally aggregate in intervals like hourly, weekly etc if you can transform timestamps to match the interval and then do group by on that transform but generate_series() improves this in 2 ways:
- you can generate aggregates for intervals for which there’s no easy way to determine what interval the timestamp is in eg 5 min or fortnightly intervals (ie there’s no “date_trunc()” / “extract(‘week’)” equivalent)
- you can aggregate densely stored data and still generate intervals for the gaps between your data. generate_series() is often used as a left/right join in order to generate the nulls/zeroes.
Unfortunately the only way to use generate_series() right now is with raw SQL. I don’t mind SQL but deviating from the standard queryset code has its disadvantages.
I’ve been playing around with the idea of creating a GenerateSeries expression that you can annotate to a queryset and just used it successfully to generate a 12 month report with aggregates per month. It made the code a lot more concise & easier to read.
Here’s what I came up with so far: https://github.com/shangxiao/stupid-django-tricks/tree/master/custom_joins
Basic usage is like so:
class Data(Model):
timestamp = DateTimeField()
data = IntegerField()
series = GenerateSeries(
start=datetime(2000, 1, 1, tzinfo=utc), # start and stop can also be expressions so that you can date_trunc() if need be!
stop=datetime(2000, 1, 5, tzinfo=utc),
step=timedelta(days=1),
join_condition=Q(
timestamp__gte=SeriesRef(), # a ref is required in order to allow the join to refer back to the series and define the interval boundaries
timestamp__lte=SeriesRef() + Value(timedelta(days=1)),
),
alias="series",
)
dataset = (
Data.objects.annotate(series)
.values("series")
.annotate(sum=Coalesce(Sum("data"), 0))
.values("series", "sum")
.order_by("series")
)
Anyone else think generate_series() would be useful to have?
Edit: I just want to point out the solution above isn’t perfect. To get the resultset to have types of timestamp with timezone
you must select the series like so: "series"."timestamptz"
. I couldn’t see any way to do this without adding a small RawSQL annotation to alias result.