Proposal: Add generate_series() support to contrib.postgres?

Hi folks,

This one’s a long shot but thought I might as well throw the idea out there:

I often make use of Postgres’ generate_series() to generate reports aggregating data against a date range in buckets of intervals like hourly, weekly, monthly etc.

You can normally aggregate in intervals like hourly, weekly etc if you can transform timestamps to match the interval and then do group by on that transform but generate_series() improves this in 2 ways:

  • you can generate aggregates for intervals for which there’s no easy way to determine what interval the timestamp is in eg 5 min or fortnightly intervals (ie there’s no “date_trunc()” / “extract(‘week’)” equivalent)
  • you can aggregate densely stored data and still generate intervals for the gaps between your data. generate_series() is often used as a left/right join in order to generate the nulls/zeroes.

Unfortunately the only way to use generate_series() right now is with raw SQL. I don’t mind SQL but deviating from the standard queryset code has its disadvantages.

I’ve been playing around with the idea of creating a GenerateSeries expression that you can annotate to a queryset and just used it successfully to generate a 12 month report with aggregates per month. It made the code a lot more concise & easier to read.

Here’s what I came up with so far: https://github.com/shangxiao/stupid-django-tricks/tree/master/custom_joins

Basic usage is like so:

class Data(Model):
    timestamp = DateTimeField()
    data = IntegerField()

series = GenerateSeries(
    start=datetime(2000, 1, 1, tzinfo=utc),  # start and stop can also be expressions so that you can date_trunc() if need be!
    stop=datetime(2000, 1, 5, tzinfo=utc),
    step=timedelta(days=1),
    join_condition=Q(
        timestamp__gte=SeriesRef(),  # a ref is required in order to allow the join to refer back to the series and define the interval boundaries
        timestamp__lte=SeriesRef() + Value(timedelta(days=1)),
    ),
    alias="series",
)

dataset = (
    Data.objects.annotate(series)
    .values("series")
    .annotate(sum=Coalesce(Sum("data"), 0))
    .values("series", "sum")
    .order_by("series")
)

Anyone else think generate_series() would be useful to have?

Edit: I just want to point out the solution above isn’t perfect. To get the resultset to have types of timestamp with timezone you must select the series like so: "series"."timestamptz". I couldn’t see any way to do this without adding a small RawSQL annotation to alias result.

+1 This would indeed be useful. We should aim for a more general-purpose virtual table capability, to allow for use of other databases’ equivalents.

Matthew Schinckel wrote a post about a custom class to make generate_series work in 2019: Functions as Tables in Django and Postgres - Schinckel.net

You’ve come up with better syntax, but maybe it could be iterated on further, especially if we can modify the ORM.

1 Like

Nice, well it’s good to see others also interested in supporting generate_series() - whether that’s generic or specific.

Generic support for virtual tables would be awesome but it does sounds like a big rabbit hole we could get into there. It’s easy enough to define a model backed by a virtual table but coming up with a way to join to models on-the-fly seems like the tricky part. There are probably a bunch of other problems I haven’t thought of yet too.

Hello David!

As Adam alluded at I think the bits that Django should focus on is providing the proper primitives for table-like and set returning expressions to be inserted and referenced through the ORM.

generate_series is an example of such expressions but unnest and the more commonly requested and database agnostic Subquery is another example of that.

If we put the problem of composite output fields aside for a moment (expressions that can return rows) I think that good first step would be to add an Expression flag that identify which expressions are set returning and use Subquery as the default one to validate the approach.

With this new metadata available we could deprecate FilteredRelation in favour of a more generic Relation expression that accepts either a resolvable relation name (like FilteredRelation does) or a set returning expression as it’s first argument.

It could then be used as

series = GenerateSeries(
    start=datetime(2000, 1, 1, tzinfo=utc),
    stop=datetime(2000, 1, 5, tzinfo=utc),
    step=timedelta(days=1)
)
Data.objects.alias(
    series=Relation(
        series,
        condition=Q(
            series__between=(...)
        )
    )
)
2 Likes