close_if_unusable_or_obsolete fails to close unusable connections

This is to revisit an old issue from the tracker, #30646 (close_if_unusable_or_obsolete fails to close unusable connections.) – Django.

I recently came across this on a production Celery worker that uses the soft_time_limit feature. Celery uses the python signal module to hijack control of a task and raise an Exception to prevent runaway tasks.

This can be reproduced without Celery though:

  1. Create a view at / in a Django application using a postgres database with CONN_MAX_AGE configured to be greater than the duration of the experiment (say ~60)
import signal
import threading
from django.db import connection

def interruption_handler(*args):
    raise Exception("Interruption")

signal.signal(signal.SIGALRM, interruption_handler)

def test_view(request):
    def hijack_control_flow():
        signal.alarm(2)

    thread = threading.Thread(target=hijack_control_flow)
    thread.start()

    with connection.cursor() as cursor:
        # this query will be interrupted by interruption_handler
        cursor.execute("SELECT pg_sleep(5);")
  1. Run the application using gunicorn --error-logfile /dev/stdout -w1 --bind localhost:8000 testapp.wsgi
  2. Run curl http://localhost:8000 and note the regular Exception: Interruption
  3. Run curl http://localhost:8000 again and note django.db.utils.OperationalError: sending query failed: another command is already in progress

The second request will never be able to use the database because it’s attempting to use an unusable connection.

I see 3 paths forward:

  1. Keep this as wontfix and have downstreams deal with it, perhaps pushing a PR to Celery to stop trying to reuse connections between tasks.
  2. Fully check if the connection is usable each request (see Fixed #30646 -- Always close unusable connections by dneuhaeuser-zalando · Pull Request #11573 · django/django · GitHub).
  3. Modify close_if_unusable_or_obsolete to take an additional argument to more rigorously check for usability.

2 seems too expensive for such an edge case, and possibly would still allow the bug to exist in the case of a database failover as reported in #30646. I’m leaning towards the first path. Does anyone have additional thoughts?

To close the loop on this, I decided to go with reverting Celery’s use of close_if_unusable_or_obsolete - Revert "Use Django DB max age connection setting" by danlamanna · Pull Request #9824 · celery/celery · GitHub.

I’m not convinced reverting was the right call here as it will likely cause idle workers for a little while to be in an even worst state?

Are you planing to follow up with 1. and make Celery stop trying to reuse connections between tasks?

I think the crux of the issue is that we likely want to change DatabaseErrorWrapper.__exit__ setting of errors_occurred = True from an allow list to a deny list. That is if an unexpected error happened the connection is most likely unusable.

FWIW we ran into a similar issue with MySQL at job and we ended up setting a monkey patch of cursor wrapping that has SoftTimeLimitExceeded handling immediately close the connection as the C and Python layers got out of sync when a signal was raised mid query execution.

We felt like this was a better approach as the task might still have to perform cleanup work involving the database in the window of time between soft time out and hard time out and if the connection is unusable and not marked accordingly it will crash even before the next task is scheduled.

Apologies for the late response. I don’t have much time to allocate towards this anymore, but since my change was shipped and there’s a complaint about a performance regression I figured I’d respond.

That’s effectively what reverting the PR does. At task finish the connection will be closed every time. How could this cause idle workers to get in to a bad state?

I’m sort of putting that back in Django’s court. Should it be expected that downstreams (that aren’t oriented towards HTTP use cases) can piggyback off the Django connection lifecycle settings?

This does sound like a better solution.

The purpose of a soft limit (versus the hard limit) in celery is that the task can exit gracefully and clean up after itself. From celery docs: “The time limit is set in two values, soft and hard. The soft time limit allows the task to catch an exception to clean up before it is killed: the hard timeout isn’t catch-able and force terminates the task.”

If you keep the connection open, then you have not cleaned up, as the query will still be in progress and using resources from your database. I assume that’s why you get this error django.db.utils.OperationalError: sending query failed: another command is already in progress because the query (select pg_sleep) is still running.

1 Like