Upgraded Django from 3.2.22 to 4.2.16, and asgiref from 3.5.2 to 3.8.1.
Phenomenon:
The execution method of the interfaces has changed from serial to parallel.
When a certain interface times out (nginx is set to 60 seconds), approximately 10 seconds later, all subsequent requests will be blocked until that interface completes execution.
In this situation, if async_to_sync is used within the interface, it will cause the interface to block, which in turn blocks other requests, ultimately leading to nginx returning a 504 timeout error. When the number of accumulated requests reaches 51, nginx will return a 502 error.
This phenomenon may be related to the upgrades of Django and asgiref, and it is recommended to optimize the code to avoid blocking issues. Version:nginx + daphne 4.1.2 + asgiref 3.8.1 + django 5.1.2
In our environment, we upgraded the packages from asgiref 3.5.2 and Django 3.2.22 to asgiref 3.8.1 and Django 5.1.2. I started the Django service using the command python manage.py runserver 127.0.0.1:8002. The frontend invokes an interface (referred to as A) through the nginx reverse proxy. After timing out, nginx returns a 504 error, interrupting the request. Approximately 10 seconds later, a message stating “took too long to shut down and was killed” is displayed. At this point, the main thread is executing asgiref.sync.ThreadSensitiveContext._aexit__, and it becomes blocked at threading.Thread._wait_for_tstate_lock, waiting for lock.acquire. When I attempt to invoke other interfaces during this time, they also become blocked.
Simultaneously, interface A executes asgiref.sync.async_to_sync, causing the code self._work_queue.get() in asgiref.current_thread_executor.CurrentThreadExecutor.run_until_future to block. This occurs because the main thread cannot send the Future object to self._work_queue. As a result, the execution of interface A cannot complete, and the main thread remains blocked until interface A finishes. This situation leads to a deadlock, preventing all subsequent requests from being processed.
Could you please help me review the code and suggest how to resolve the deadlock issue described above?
I found that deleting async with ThreadSensitiveContext() from django.core.handlers.asgi.ASGIHandler.__call__ resolves the problem. However, I am concerned about any potential negative impacts of this change.
It looks like the cancelation is not making it the interface_a future. I need to experiment to see if I can reproduce that, and dig into why.
Immediately, I’d suggest trying to wrap the long-running time.sleep(80) part in a timeout that you know is shorter than the nginx timeout, so observing the behaviour then.
cancel(): If the call is currently being executed… then the method will return False , otherwise the call will be cancelled and the method will return True .
So, your interface_a method isn’t being cancelled (as per the appearance).
That doesn’t quite why the async_to_sync() call blocks (quite possibly because it’s not scheduled on the event loop, with the parent task having been cancelled) but it’s going to be essential you add some control to complete the thread task in case it takes too long.
Hi, @carltongibson, Thanks for your suggestion,
I have read part of the Django source code and discovered through debugging that when the time.sleep in interface A exceeds the nginx timeout + 10 seconds, the main thread will be blocked at lock.acquire() until interface A completes execution.The following is part of the call stack:
I still need to experiment to look into the details there. (If you provided a running minimal reproduce, that would save a cycle recreating one from the description.)
I suspect the issue is caused by the fact the slow sync task is taking longer than the nginx timeout, so the executor is cancelling before the async task can begin. I suggested ensuring the timeout there is shorter than the nginx timeout as a potential workaround for the moment.
Hi, @carltongibson, thank you for your patient response. Is there any difficulty in reproducing this issue?
I found that deleting async with ThreadSensitiveContext() from django.core.handlers.asgi.ASGIHandler.__call__ resolves the problem. Serial processing between interfaces is not an issue for our system. Are there any potential negative impacts of this change?
We would like to adopt your previous suggestion, “I’d suggest trying to wrap the long-running time.sleep(80) part in a timeout that you know is shorter than the nginx timeout”. I tried it in a interface, it works. We should change all of the interface, Does Django have a general configuration for this feature?
There’s nothing built-in to Django for timeouts here. It would depend on what the long running task is, timeouts being quite specific to the code you’re running.
Most libraries will provide a timeout mechanism where that makes sense. As an example HTTPX has timeout handling
Hi @carltongibson, I am considering adopting this solution, but there might be security risks. If someone exploits this feature to launch an attack, it could cause the system to fail. Is there a good way to prevent such security issues?
Rate limiting would be your usual first port of call. (There are a few ecosystem packages for this if you search.)
Add timeouts to calls to external services would (in general) make your application more robust, and given the bug you hit here would seem to definitely do that for you in this case.
I’m not sure what else I can say at this level of detail. Good luck! Have fun!