Proposal: Add generate_series() support to contrib.postgres?

shangxiao · June 29, 2023, 1:01pm

Hi folks,

This one’s a long shot but thought I might as well throw the idea out there:

I often make use of Postgres’ generate_series() to generate reports aggregating data against a date range in buckets of intervals like hourly, weekly, monthly etc.

You can normally aggregate in intervals like hourly, weekly etc if you can transform timestamps to match the interval and then do group by on that transform but generate_series() improves this in 2 ways:

you can generate aggregates for intervals for which there’s no easy way to determine what interval the timestamp is in eg 5 min or fortnightly intervals (ie there’s no “date_trunc()” / “extract(‘week’)” equivalent)
you can aggregate densely stored data and still generate intervals for the gaps between your data. generate_series() is often used as a left/right join in order to generate the nulls/zeroes.

Unfortunately the only way to use generate_series() right now is with raw SQL. I don’t mind SQL but deviating from the standard queryset code has its disadvantages.

I’ve been playing around with the idea of creating a GenerateSeries expression that you can annotate to a queryset and just used it successfully to generate a 12 month report with aggregates per month. It made the code a lot more concise & easier to read.

Here’s what I came up with so far: https://github.com/shangxiao/stupid-django-tricks/tree/master/custom_joins

Basic usage is like so:

class Data(Model):
    timestamp = DateTimeField()
    data = IntegerField()

series = GenerateSeries(
    start=datetime(2000, 1, 1, tzinfo=utc),  # start and stop can also be expressions so that you can date_trunc() if need be!
    stop=datetime(2000, 1, 5, tzinfo=utc),
    step=timedelta(days=1),
    join_condition=Q(
        timestamp__gte=SeriesRef(),  # a ref is required in order to allow the join to refer back to the series and define the interval boundaries
        timestamp__lte=SeriesRef() + Value(timedelta(days=1)),
    ),
    alias="series",
)

dataset = (
    Data.objects.annotate(series)
    .values("series")
    .annotate(sum=Coalesce(Sum("data"), 0))
    .values("series", "sum")
    .order_by("series")
)

Anyone else think generate_series() would be useful to have?

Edit: I just want to point out the solution above isn’t perfect. To get the resultset to have types of timestamp with timezone you must select the series like so: "series"."timestamptz". I couldn’t see any way to do this without adding a small RawSQL annotation to alias result.

adamchainz · July 3, 2023, 10:21am

+1 This would indeed be useful. We should aim for a more general-purpose virtual table capability, to allow for use of other databases’ equivalents.

Matthew Schinckel wrote a post about a custom class to make generate_series work in 2019: Functions as Tables in Django and Postgres - Schinckel.net

You’ve come up with better syntax, but maybe it could be iterated on further, especially if we can modify the ORM.

shangxiao · July 10, 2023, 8:47am

Nice, well it’s good to see others also interested in supporting generate_series() - whether that’s generic or specific.

Generic support for virtual tables would be awesome but it does sounds like a big rabbit hole we could get into there. It’s easy enough to define a model backed by a virtual table but coming up with a way to join to models on-the-fly seems like the tricky part. There are probably a bunch of other problems I haven’t thought of yet too.

charettes · July 12, 2023, 2:52am

Hello David!

As Adam alluded at I think the bits that Django should focus on is providing the proper primitives for table-like and set returning expressions to be inserted and referenced through the ORM.

generate_series is an example of such expressions but unnest and the more commonly requested and database agnostic Subquery is another example of that.

If we put the problem of composite output fields aside for a moment (expressions that can return rows) I think that good first step would be to add an Expression flag that identify which expressions are set returning and use Subquery as the default one to validate the approach.

With this new metadata available we could deprecate FilteredRelation in favour of a more generic Relation expression that accepts either a resolvable relation name (like FilteredRelation does) or a set returning expression as it’s first argument.

It could then be used as

series = GenerateSeries(
    start=datetime(2000, 1, 1, tzinfo=utc),
    stop=datetime(2000, 1, 5, tzinfo=utc),
    step=timedelta(days=1)
)
Data.objects.alias(
    series=Relation(
        series,
        condition=Q(
            series__between=(...)
        )
    )
)

charettes · July 10, 2024, 12:56am

Trying to encourage discussions on a potential way to add support for set-returning (table-valued) functions to the ORM in this ticket is you’re still eager to see it land @shangxiao.

shangxiao · July 10, 2024, 7:03am

Thanks Simon

I’m just coming down off the “all hands on deck” mode I’ve been on over the past 2 months and am keen to revisit some of the things I’ve left off - this included (very timely that you pinged me).

One of the things I’ve recently wanted to add to Django is the ability to have LATERAL JOIN as a way to extend on the SubQuery() scalar subquery select capability - as lateral joins offer the ability to return entire records and not just scalars.

I’ll read through that ticket you linked this evening

Topic		Replies	Views
generate series & cross join Using the ORM	0	352	April 11, 2024
Django repeats the Postgres generate_series() function call in the WHERE clause Using the ORM	3	215	April 18, 2024
How to make 'generate_series' tzaware with django Using Django	15	2043	March 8, 2021
Generating counts of sales over date ranges with filtering Using the ORM	2	33	August 31, 2024
Trying to map a Postgres SQL Query to an ORM queryset Using the ORM	2	148	May 25, 2024

Proposal: Add generate_series() support to contrib.postgres?

Related topics