order_with_respect_to pointing to a null ForeignKey causes expensive query

I have a model that looks like this:

class CharacterSheetNode(models.Model):
    parent = models.ForeignKey('self', related_name='items', null=True, on_delete=models.CASCADE)

    # other fields

    class Meta:
        order_with_respect_to = 'parent'

I have 30M+ instances of this model in my database. My problem is that when I create a new instance of it, the following Django code is ran:

site-packages\django\db\models\base.py, line 876:

if meta.order_with_respect_to:
    # If this is a model with an order_with_respect_to
    # autopopulate the _order field
    field = meta.order_with_respect_to
    filter_args = field.get_filter_kwargs_for_object(self)
    self._order = cls._base_manager.using(using).filter(**filter_args).aggregate(
                    _order__max=Coalesce(
                        ExpressionWrapper(Max('_order') + Value(1), output_field=IntegerField()),
                        Value(0),
                    ),
                )['_order__max']

This piece of code will go through all the CharacterSheetNode with no parent (there are probably 1M of them), and look for the Max('_order') value. As if they were all children of the same parent ā€œNoneā€.

The query itself takes 7 seconds to run.

How can I avoid running this code if the parent is null? Or at least tell Django to only set the order_value if there is a parent?

The whole method from which these lines were extracted:

def _save_table(self, raw=False, cls=None, force_insert=False,
                force_update=False, using=None, update_fields=None):
    """
    Do the heavy-lifting involved in saving. Update or insert the data
    for a single table.
    """
    meta = cls._meta
    non_pks = [f for f in meta.local_concrete_fields if not f.primary_key]

    if update_fields:
        non_pks = [f for f in non_pks
                   if f.name in update_fields or f.attname in update_fields]

    pk_val = self._get_pk_val(meta)
    if pk_val is None:
        pk_val = meta.pk.get_pk_value_on_save(self)
        setattr(self, meta.pk.attname, pk_val)
    pk_set = pk_val is not None
    if not pk_set and (force_update or update_fields):
        raise ValueError("Cannot force an update in save() with no primary key.")
    updated = False
    # Skip an UPDATE when adding an instance and primary key has a default.
    if (
        not raw and
        not force_insert and
        self._state.adding and
        meta.pk.default and
        meta.pk.default is not NOT_PROVIDED
    ):
        force_insert = True
    # If possible, try an UPDATE. If that doesn't update anything, do an INSERT.
    if pk_set and not force_insert:
        base_qs = cls._base_manager.using(using)
        values = [(f, None, (getattr(self, f.attname) if raw else f.pre_save(self, False)))
                  for f in non_pks]
        forced_update = update_fields or force_update
        updated = self._do_update(base_qs, using, pk_val, values, update_fields,
                                  forced_update)
        if force_update and not updated:
            raise DatabaseError("Forced update did not affect any rows.")
        if update_fields and not updated:
            raise DatabaseError("Save with update_fields did not affect any rows.")
    if not updated:
        if meta.order_with_respect_to:
            # If this is a model with an order_with_respect_to
            # autopopulate the _order field
            field = meta.order_with_respect_to
            filter_args = field.get_filter_kwargs_for_object(self)
            self._order = cls._base_manager.using(using).filter(**filter_args).aggregate(
                _order__max=Coalesce(
                    ExpressionWrapper(Max('_order') + Value(1), output_field=IntegerField()),
                    Value(0),
                ),
            )['_order__max']
        fields = meta.local_concrete_fields
        if not pk_set:
            fields = [f for f in fields if f is not meta.auto_field]

        returning_fields = meta.db_returning_fields
        results = self._do_insert(cls._base_manager, using, fields, returning_fields, raw)
        if results:
            for value, field in zip(results[0], returning_fields):
                setattr(self, field.attname, value)
    return updated

I am using Django 3.1

Iā€™m not following exactly what youā€™re trying to accomplish here, but I will toss out a couple ideas that might be of value based upon my current understanding.

Which is trying to accomplish this:

Seems to me that this objective, as written would be more quickly retrieved as:
cls._base_manager.using(using).filter(parent=None).aggregate(Max(ā€˜_orderā€™))

And, if thatā€™s still too slow, Iā€™d look at creating a database index on the (parent, _order) field pair. That should greatly improve the performance of that query.

BUT

If youā€™re using this to get the next value of a sequence, this entire path of logic is subject to a race condition where two nearly concurrent requests can end up getting the same ā€œnext valueā€.

You need to carefully consider whether or not you really need these values to be in a strict-sequential-ascending order, or whether or not ā€œgapsā€ in the sequence is acceptable. If you can live with potential gaps, this becomes a much easier problem to solve.

You could define _order as an auto-increment field, allowing the database to assign the proper next value for you - that ensures that you never have a duplicate but leaves open the possibility of gaps.

Or, if you donā€™t want to have your sequence incremented on every insert, you could create your own sequence object, and pull from that under the conditions in which you want it to be used. (This can still create gaps in the number sequencing under some edge-case conditions.)

The only way to ensure that there are no gaps and no duplicates involves wrapping the retrieval of the next value to use in a transaction along with the insertion of the row containing that value, holding a complete lock on the table during this process. This may or may not be acceptable depending upon the frequency of inserts.

Ken

Wait, just to be clear, this is not my code :sweat_smile:. Itā€™s Djangoā€™s code from site-packages\django\db\models\base.py, line 876

Only the first block, defining CharacterSheetNode is mine.
There rest is the code that is called by Django when I just create a new instance of CharacterSheetNode.

This is not my objective. This is what Django currently does and that I would like it to not do.

Ok, so lets start from the other direction - what is the code youā€™ve written that is causing this to happen?

This would include showing the complete model and view involved in the process where a CharacterSheetNode is created and saved, including any ancillary utility functions that may be called.

I reach this point by calling CharacterSheetNode.objects.bulk_create.

The Django method _save_table is independent of my code, though. The problematic part is called as soon as you have an order_with_respect_to value set in the modelā€™s meta.

My problem is that Django will consider all the models with a None order_with_respect_to object as being ordered with respect to the same object. Making the Max('_order') call extremely slow when you have a lot of those.

I donā€™t know if I am clear enough but with an example:

  • When I add a Node to a Parent Node that has 15 children. The Django method will call Max('_order') on the 15 children to find the _order value of the new node.
  • When I add a Node without a Parent Node. The Django method will call Max('_order') on the 10M+ Nodes that also have no Parent Node to return the _order value.

I currently fixed it by editing the Django code with a hack:

            if meta.order_with_respect_to:
                # If this is a model with an order_with_respect_to
                # autopopulate the _order field
                field = meta.order_with_respect_to
                filter_args = field.get_filter_kwargs_for_object(self)

                # HACK TO AVOID SETTING _order IF order_with_respect_to KEY IS NONE
                value_iterator = iter(filter_args.values())
                first_value = next(value_iterator)

                if first_value is None:
                    self._order = 0
                else:
                    self._order = cls._base_manager.using(using).filter(**filter_args).aggregate(
                        _order__max=Coalesce(
                            ExpressionWrapper(Max('_order') + Value(1), output_field=IntegerField()),
                            Value(0),
                        ),
                    )['_order__max']

But I would like to achieve that without hack :slight_smile:.

Maybe it is an oversight from the Django dev team?

1 Like

Iā€™m not sure itā€™s an ā€œoversightā€ as much as itā€™s a side-effect of what order_with_respect_to attempts to handle.

But I would actually go back to my first thought - if, as a result of applying the order_with_respect_to clause, Django doesnā€™t create an index on the (foreign key, _order) tuple, Iā€™d be tempted to try to create one manually to see if that significantly improves performance. (I would hope that the Max method as sent to the backend database would take advantage of it - but I would want to verify that via an explain and/or explain analyze on the actual query being issued.)

If such an index has been created and doesnā€™t help, Iā€™d be tempted to go back and take another look at my requirements to see if thereā€™s a different way to handle this.

Just a check-in a few months later. Was this issue ever looked at by any chance?