close_if_unusable_or_obsolete fails to close unusable connections

This is to revisit an old issue from the tracker, #30646 (close_if_unusable_or_obsolete fails to close unusable connections.) – Django.

I recently came across this on a production Celery worker that uses the soft_time_limit feature. Celery uses the python signal module to hijack control of a task and raise an Exception to prevent runaway tasks.

This can be reproduced without Celery though:

  1. Create a view at / in a Django application using a postgres database with CONN_MAX_AGE configured to be greater than the duration of the experiment (say ~60)
import signal
import threading
from django.db import connection

def interruption_handler(*args):
    raise Exception("Interruption")

signal.signal(signal.SIGALRM, interruption_handler)

def test_view(request):
    def hijack_control_flow():
        signal.alarm(2)

    thread = threading.Thread(target=hijack_control_flow)
    thread.start()

    with connection.cursor() as cursor:
        # this query will be interrupted by interruption_handler
        cursor.execute("SELECT pg_sleep(5);")
  1. Run the application using gunicorn --error-logfile /dev/stdout -w1 --bind localhost:8000 testapp.wsgi
  2. Run curl http://localhost:8000 and note the regular Exception: Interruption
  3. Run curl http://localhost:8000 again and note django.db.utils.OperationalError: sending query failed: another command is already in progress

The second request will never be able to use the database because it’s attempting to use an unusable connection.

I see 3 paths forward:

  1. Keep this as wontfix and have downstreams deal with it, perhaps pushing a PR to Celery to stop trying to reuse connections between tasks.
  2. Fully check if the connection is usable each request (see Fixed #30646 -- Always close unusable connections by dneuhaeuser-zalando · Pull Request #11573 · django/django · GitHub).
  3. Modify close_if_unusable_or_obsolete to take an additional argument to more rigorously check for usability.

2 seems too expensive for such an edge case, and possibly would still allow the bug to exist in the case of a database failover as reported in #30646. I’m leaning towards the first path. Does anyone have additional thoughts?

To close the loop on this, I decided to go with reverting Celery’s use of close_if_unusable_or_obsolete - Revert "Use Django DB max age connection setting" by danlamanna · Pull Request #9824 · celery/celery · GitHub.

I’m not convinced reverting was the right call here as it will likely cause idle workers for a little while to be in an even worst state?

Are you planing to follow up with 1. and make Celery stop trying to reuse connections between tasks?

I think the crux of the issue is that we likely want to change DatabaseErrorWrapper.__exit__ setting of errors_occurred = True from an allow list to a deny list. That is if an unexpected error happened the connection is most likely unusable.

FWIW we ran into a similar issue with MySQL at job and we ended up setting a monkey patch of cursor wrapping that has SoftTimeLimitExceeded handling immediately close the connection as the C and Python layers got out of sync when a signal was raised mid query execution.

We felt like this was a better approach as the task might still have to perform cleanup work involving the database in the window of time between soft time out and hard time out and if the connection is unusable and not marked accordingly it will crash even before the next task is scheduled.