Unexpected deadlock across two separate rows, using Postgres 17 and select_for_update().

Hi,

I have a table of Endpoints which looks like this:

class Endpoint(db_models.Model):
    ...
    op = db_models.ForeignKey(ReportOp, on_delete=db_models.CASCADE)
    client = db_models.ForeignKey(Client, on_delete=db_models.CASCADE, null=True, blank=True)
    client_private = db_models.IntegerField(_('Client id or zero'), blank=True)
    netloc = db_models.URLField(_('URL'), help_text=_('...'))
    ...

    ... various data fields...

    ...

    class Meta:
        unique_together = ('op', 'netloc', 'client_private')

I modify one row at a time in this table from separate Linux gunicorn/django processes, using a single point in the code, which looks like this:

def locked_instance(self, ...):

    // Ensure the row exists.

    with transaction.atomic():
        instance, created = Endpoint.objects.get_or_create(kwargs | {'client': client}, **kwargs)
        ...initialise the various data fields...

    // Return locked reference to the row

    qs = Endpoint.objects.select_for_update().filter(id=instance.id)
    return qs

To my surprise, I got a deadlock where the Postgres server log shows two DIFFERENT rows (id = 1 and id = 2 below) being involved. Here is the server log:

ERROR: deadlock detected
DETAIL: Process 15576 waits for ShareLock on transaction 31053599; blocked by process 16953.
Process 16953 waits for ShareLock on transaction 31053597; blocked by process 15576.
Process 15576: SELECT “paiyroll_endpoint”.“id”, “paiyroll_endpoint”.“op_id”, “paiyroll_endpoint”.“client_id”, “paiyroll_endpoint”.“client_private”, “paiyroll_endpoint”.“netloc”, “paiyroll_endpoint”.“calls”, “paiyroll_endpoint”.“ms”, “paiyroll_endpoint”.“history”, “paiyroll_endpoint”.“current_history” FROM “paiyroll_endpoint” WHERE “paiyroll_endpoint”.“id” = 1 FOR UPDATE
Process 16953: SELECT “paiyroll_endpoint”.“id”, “paiyroll_endpoint”.“op_id”, “paiyroll_endpoint”.“client_id”, “paiyroll_endpoint”.“client_private”, “paiyroll_endpoint”.“netloc”, “paiyroll_endpoint”.“calls”, “paiyroll_endpoint”.“ms”, “paiyroll_endpoint”.“history”, “paiyroll_endpoint”.“current_history” FROM “paiyroll_endpoint” WHERE “paiyroll_endpoint”.“id” = 2 FOR UPDATE
HINT: See server log for query details.
CONTEXT: while locking tuple (7,15) in relation “paiyroll_endpoint”
STATEMENT: SELECT “paiyroll_endpoint”.“id”, “paiyroll_endpoint”.“op_id”, “paiyroll_endpoint”.“client_id”, “paiyroll_endpoint”.“client_private”, “paiyroll_endpoint”.“netloc”, “paiyroll_endpoint”.“calls”, “paiyroll_endpoint”.“ms”, “paiyroll_endpoint”.“history”, “paiyroll_endpoint”.“current_history” FROM “paiyroll_endpoint” WHERE “paiyroll_endpoint”.“id” = 1 FOR UPDATE

How can there be a deadlock between two separate rows? I guess this means there is a problem with my understanding and/or the code, but I’m not sure what that could be. I have looked into the “select_for_update(no_key=True)” rowlock docs for Postgres, but I confess to being somewhat out of my depth. Any pointers would be most welcome…

Thanks, Shaheed

Hi! How are you using this method? The returned queryset must be used within a transaction where you’ll need to handle the various race conditions. The docs try to point this out here: QuerySet API reference | Django documentation | Django

@CodenameTim Thanks for the reply. The caller looks like this:

        qs = row.locked_instance()
        with transaction.atomic():
            for db_row in qs.all():
                ...moderate amount of compute bound stuff...
                db_row.save()

I believe that is what the docs call for.

In any event, in the failure that I caught in the logs, there were two different rows involved so I suspect I have missed some other bigger truth such that, somehow, row-level locking is not being used.

How can there be a deadlock between two separate rows?

The logs for deadlocks may seems strange at first sight, because it is logged the moment in which the deadlock occurs. What your logs says is: “process 15576 would like to lock id=2 (and is locking id=1) while 16953 would like to lock id=1 (and is locking id=2)”.

When using locks it is very important to check for the order in which your rows are going to be locked: if this order is not enforced it can cause unpredictable deadlocks, because each process will find the first row “unlocked” and fall into the one which was locked by the other.

If you add an order-by in your queries this should not happen, because each process is going to fetch rows in the same order and will need to wait for the other to release the lock.

Do you have logs that show both of the locks?

@sevdog ok, but the locked queryset is defined to only contain a single row .filter(id=instance.id), so I don’t see why ordering would make a difference in my case?

@CodenameTim I have the client side error log as well as the Postgres server side error log posted above. But that seemed less use than the server log. Is there something more specific you had in mind (that I would need to enable)? FWIW, the production setup is on AWS RDS but I can try to repro locally if needed.

@ShaheedHaque from the snippet you have posted you are right, it should not cause deadlock. Is there any outer transaction or a loop in which your code invoke the above snippets?

Do you use ATOMIC_REQUESTS?

1 Like

you are handling the circular wait correctly. but I think your two transaction are locking rows in different order and they are waiting for the release of resource.

Thanks for all the replies…

@sevdog No, there is no outer transaction. You also asked if ATOMIC_REQUESTS was set and I was about to say “yes” but then I read up on ATOMIC_REQUESTS, and it turns out I do not have it set, but this code is typically executed from a Celery context as opposed to a view context.

(Aside: I have spent the last 7 years labouring under the misapprehension that Django always wrapped views in transactions. Gulp. Some serious introspection is going to be needed).

Is there reason to think the lack of ATOMIC_REQUESTS might be a contributing factor?

@shubh-gitpush I don’t think this is correct: please see my previous reply about there only being one row in each query.

Perhaps I need to ask the Postgres folk?

yes one row in each query,but deadlock could happen if two transactions are running at same , I think there is one more query inside same transaction which request the row that is already locked by another transaction.

Maybe the error log only shows the query where the deadlock was detected, not necessarily the previous query that already locked the other row. There could be a possibility each transaction likely already locked another row earlier. I am not sure but you could check that and in postgre community. I hope you get your solution

Following up here: