I recently came across this on a production Celery worker that uses the soft_time_limit feature. Celery uses the python signal module to hijack control of a task and raise an Exception to prevent runaway tasks.
This can be reproduced without Celery though:
Create a view at / in a Django application using a postgres database with CONN_MAX_AGE configured to be greater than the duration of the experiment (say ~60)
import signal
import threading
from django.db import connection
def interruption_handler(*args):
raise Exception("Interruption")
signal.signal(signal.SIGALRM, interruption_handler)
def test_view(request):
def hijack_control_flow():
signal.alarm(2)
thread = threading.Thread(target=hijack_control_flow)
thread.start()
with connection.cursor() as cursor:
# this query will be interrupted by interruption_handler
cursor.execute("SELECT pg_sleep(5);")
Run the application using gunicorn --error-logfile /dev/stdout -w1 --bind localhost:8000 testapp.wsgi
Run curl http://localhost:8000 and note the regular Exception: Interruption
Run curl http://localhost:8000 again and note django.db.utils.OperationalError: sending query failed: another command is already in progress
The second request will never be able to use the database because it’s attempting to use an unusable connection.
I see 3 paths forward:
Keep this as wontfix and have downstreams deal with it, perhaps pushing a PR to Celery to stop trying to reuse connections between tasks.
Modify close_if_unusable_or_obsolete to take an additional argument to more rigorously check for usability.
2 seems too expensive for such an edge case, and possibly would still allow the bug to exist in the case of a database failover as reported in #30646. I’m leaning towards the first path. Does anyone have additional thoughts?
FWIW we ran into a similar issue with MySQL at job and we ended up setting a monkey patch of cursor wrapping that has SoftTimeLimitExceeded handling immediately close the connection as the C and Python layers got out of sync when a signal was raised mid query execution.
We felt like this was a better approach as the task might still have to perform cleanup work involving the database in the window of time between soft time out and hard time out and if the connection is unusable and not marked accordingly it will crash even before the next task is scheduled.
Apologies for the late response. I don’t have much time to allocate towards this anymore, but since my change was shipped and there’s a complaint about a performance regression I figured I’d respond.
That’s effectively what reverting the PR does. At task finish the connection will be closed every time. How could this cause idle workers to get in to a bad state?
I’m sort of putting that back in Django’s court. Should it be expected that downstreams (that aren’t oriented towards HTTP use cases) can piggyback off the Django connection lifecycle settings?
The purpose of a soft limit (versus the hard limit) in celery is that the task can exit gracefully and clean up after itself. From celery docs: “The time limit is set in two values, soft and hard. The soft time limit allows the task to catch an exception to clean up before it is killed: the hard timeout isn’t catch-able and force terminates the task.”
If you keep the connection open, then you have not cleaned up, as the query will still be in progress and using resources from your database. I assume that’s why you get this error django.db.utils.OperationalError: sending query failed: another command is already in progress because the query (select pg_sleep) is still running.