Using asgiref 3.8.1 and django 4.2.16 result deadlock

Background:

Upgraded Django from 3.2.22 to 4.2.16, and asgiref from 3.5.2 to 3.8.1.

Phenomenon:

  1. The execution method of the interfaces has changed from serial to parallel.
  2. When a certain interface times out (nginx is set to 60 seconds), approximately 10 seconds later, all subsequent requests will be blocked until that interface completes execution.
  3. In this situation, if async_to_sync is used within the interface, it will cause the interface to block, which in turn blocks other requests, ultimately leading to nginx returning a 504 timeout error. When the number of accumulated requests reaches 51, nginx will return a 502 error.

This phenomenon may be related to the upgrades of Django and asgiref, and it is recommended to optimize the code to avoid blocking issues.
Version: nginx + daphne 4.1.2 + asgiref 3.8.1 + django 5.1.2

In our environment, we upgraded the packages from asgiref 3.5.2 and Django 3.2.22 to asgiref 3.8.1 and Django 5.1.2. I started the Django service using the command python manage.py runserver 127.0.0.1:8002. The frontend invokes an interface (referred to as A) through the nginx reverse proxy. After timing out, nginx returns a 504 error, interrupting the request. Approximately 10 seconds later, a message stating “took too long to shut down and was killed” is displayed. At this point, the main thread is executing asgiref.sync.ThreadSensitiveContext._aexit__, and it becomes blocked at threading.Thread._wait_for_tstate_lock, waiting for lock.acquire. When I attempt to invoke other interfaces during this time, they also become blocked.

Simultaneously, interface A executes asgiref.sync.async_to_sync, causing the code self._work_queue.get() in asgiref.current_thread_executor.CurrentThreadExecutor.run_until_future to block. This occurs because the main thread cannot send the Future object to self._work_queue. As a result, the execution of interface A cannot complete, and the main thread remains blocked until interface A finishes. This situation leads to a deadlock, preventing all subsequent requests from being processed.

Here is a demo of interface A:

def interface_a(request):
    # Long-running code
    cost_long_time_execute()
    # Code that triggers deadlock
    async_to_sync(execute_async)()
    return HttpResponse(content='{"status": "ok"}')

Could you please help me review the code and suggest how to resolve the deadlock issue described above?

I found that deleting async with ThreadSensitiveContext() from django.core.handlers.asgi.ASGIHandler.__call__ resolves the problem. However, I am concerned about any potential negative impacts of this change.

Hi @luxiaoyong.

Hmmm. Good description but, it’s difficult to say too much just from what’s there.

First can you try with asgiref==3.7.2, and see if that helps.

Then, if you can reduce to a minimal example that recreates the behaviour, it’s possible to say a bit more.

Thank you very much for your attention,

  1. Nginx is configured with a timeout of 60 seconds, and the Django service is started using the command python manage.py runserver 127.0.0.1:8002.
  2. The interface code is as follows:
    async def async_sleep():
        await asyncio.sleep(5)
    
    def interface_a(request):
        time.sleep(80)  # 80 seconds exceeds nginx timeout + 10 seconds
        # Code that triggers deadlock
        asgiref.sync.async_to_sync(async_sleep)()
        return HttpResponse(content='{"status": "ok"}')
    
  3. When the above interface is called, nginx returns a 504 error after 60 seconds.
  4. After an additional 10 seconds, calls to other interfaces become blocked and cannot recover.
  5. We tested with asgiref version 3.7.2, which did not result in a deadlock. However, the main thread remains blocked, which is the critical issue.
1 Like

Thanks @luxiaoyong.

It looks like the cancelation is not making it the interface_a future. I need to experiment to see if I can reproduce that, and dig into why.

Immediately, I’d suggest trying to wrap the long-running time.sleep(80) part in a timeout that you know is shorter than the nginx timeout, so observing the behaviour then.

From the Python docs:

cancel(): If the call is currently being executed… then the method will return False , otherwise the call will be cancelled and the method will return True .

So, your interface_a method isn’t being cancelled (as per the appearance).

That doesn’t quite why the async_to_sync() call blocks (quite possibly because it’s not scheduled on the event loop, with the parent task having been cancelled) but it’s going to be essential you add some control to complete the thread task in case it takes too long.

Hi, @carltongibson, Thanks for your suggestion,
I have read part of the Django source code and discovered through debugging that when the time.sleep in interface A exceeds the nginx timeout + 10 seconds, the main thread will be blocked at lock.acquire() until interface A completes execution.The following is part of the call stack:

.
Is this a potential issue with Django?

The issue is this one from your initial post:

I still need to experiment to look into the details there. (If you provided a running minimal reproduce, that would save a cycle recreating one from the description.)

I suspect the issue is caused by the fact the slow sync task is taking longer than the nginx timeout, so the executor is cancelling before the async task can begin. I suggested ensuring the timeout there is shorter than the nginx timeout as a potential workaround for the moment.

Hi, @carltongibson, thank you for your patient response. Is there any difficulty in reproducing this issue?
I found that deleting async with ThreadSensitiveContext() from django.core.handlers.asgi.ASGIHandler.__call__ resolves the problem.
Serial processing between interfaces is not an issue for our system. Are there any potential negative impacts of this change?

Hi @luxiaoyong

No problem. I haven’t had a chance to dig into it yet. (Alas, WORK :sweat_smile:)

If you happened to have a minimal reproduce project, then that saves me a cycle, but it looks like there’s sufficient information here.

(Just TIME :mantelpiece_clock:)

As long as your code isn’t in fact thread-sensitive, you should be OK — but without looking more closely I simply cannot say conclusively.

We would like to adopt your previous suggestion, “I’d suggest trying to wrap the long-running time.sleep(80) part in a timeout that you know is shorter than the nginx timeout”. I tried it in a interface, it works. We should change all of the interface, Does Django have a general configuration for this feature?

@luxiaoyong I’m glad the suggestion works.

There’s nothing built-in to Django for timeouts here. It would depend on what the long running task is, timeouts being quite specific to the code you’re running.

Most libraries will provide a timeout mechanism where that makes sense. As an example HTTPX has timeout handling

Hi @carltongibson, I am considering adopting this solution, but there might be security risks. If someone exploits this feature to launch an attack, it could cause the system to fail. Is there a good way to prevent such security issues?

Rate limiting would be your usual first port of call. (There are a few ecosystem packages for this if you search.)

Add timeouts to calls to external services would (in general) make your application more robust, and given the bug you hit here would seem to definitely do that for you in this case.

I’m not sure what else I can say at this level of detail. Good luck! Have fun!